blob: 7df9251e1678bbbce49776008446464457f051d0 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`parser` --- Access Python parse trees
3===========================================
4
5.. module:: parser
6 :synopsis: Access parse trees for Python source code.
7.. moduleauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
9
10
Georg Brandlb19be572007-12-29 10:57:00 +000011.. Copyright 1995 Virginia Polytechnic Institute and State University and Fred
12 L. Drake, Jr. This copyright notice must be distributed on all copies, but
13 this document otherwise may be distributed as part of the Python
14 distribution. No fee may be charged for this document in any representation,
15 either on paper or electronically. This restriction does not affect other
16 elements in a distributed package in any way.
Georg Brandl8ec7f652007-08-15 14:28:01 +000017
18.. index:: single: parsing; Python source code
19
20The :mod:`parser` module provides an interface to Python's internal parser and
21byte-code compiler. The primary purpose for this interface is to allow Python
22code to edit the parse tree of a Python expression and create executable code
23from this. This is better than trying to parse and modify an arbitrary Python
24code fragment as a string because parsing is performed in a manner identical to
25the code forming the application. It is also faster.
26
Georg Brandl9cea5112008-06-07 18:17:37 +000027.. note::
28
29 From Python 2.5 onward, it's much more convenient to cut in at the Abstract
30 Syntax Tree (AST) generation and compilation stage, using the :mod:`ast`
31 module.
32
33 The :mod:`parser` module exports the names documented here also with "st"
34 replaced by "ast"; this is a legacy from the time when there was no other
35 AST and has nothing to do with the AST found in Python 2.5. This is also the
36 reason for the functions' keyword arguments being called *ast*, not *st*.
37
Georg Brandl8ec7f652007-08-15 14:28:01 +000038There are a few things to note about this module which are important to making
39use of the data structures created. This is not a tutorial on editing the parse
40trees for Python code, but some examples of using the :mod:`parser` module are
41presented.
42
43Most importantly, a good understanding of the Python grammar processed by the
44internal parser is required. For full information on the language syntax, refer
45to :ref:`reference-index`. The parser
46itself is created from a grammar specification defined in the file
47:file:`Grammar/Grammar` in the standard Python distribution. The parse trees
Georg Brandl9cea5112008-06-07 18:17:37 +000048stored in the ST objects created by this module are the actual output from the
Georg Brandl8ec7f652007-08-15 14:28:01 +000049internal parser when created by the :func:`expr` or :func:`suite` functions,
Georg Brandl9cea5112008-06-07 18:17:37 +000050described below. The ST objects created by :func:`sequence2st` faithfully
Georg Brandl8ec7f652007-08-15 14:28:01 +000051simulate those structures. Be aware that the values of the sequences which are
52considered "correct" will vary from one version of Python to another as the
53formal grammar for the language is revised. However, transporting code from one
54Python version to another as source text will always allow correct parse trees
55to be created in the target version, with the only restriction being that
56migrating to an older version of the interpreter will not support more recent
57language constructs. The parse trees are not typically compatible from one
58version to another, whereas source code has always been forward-compatible.
59
Georg Brandl9cea5112008-06-07 18:17:37 +000060Each element of the sequences returned by :func:`st2list` or :func:`st2tuple`
Georg Brandl8ec7f652007-08-15 14:28:01 +000061has a simple form. Sequences representing non-terminal elements in the grammar
62always have a length greater than one. The first element is an integer which
63identifies a production in the grammar. These integers are given symbolic names
64in the C header file :file:`Include/graminit.h` and the Python module
65:mod:`symbol`. Each additional element of the sequence represents a component
66of the production as recognized in the input string: these are always sequences
67which have the same form as the parent. An important aspect of this structure
68which should be noted is that keywords used to identify the parent node type,
69such as the keyword :keyword:`if` in an :const:`if_stmt`, are included in the
70node tree without any special treatment. For example, the :keyword:`if` keyword
71is represented by the tuple ``(1, 'if')``, where ``1`` is the numeric value
72associated with all :const:`NAME` tokens, including variable and function names
73defined by the user. In an alternate form returned when line number information
74is requested, the same token might be represented as ``(1, 'if', 12)``, where
75the ``12`` represents the line number at which the terminal symbol was found.
76
77Terminal elements are represented in much the same way, but without any child
78elements and the addition of the source text which was identified. The example
79of the :keyword:`if` keyword above is representative. The various types of
80terminal symbols are defined in the C header file :file:`Include/token.h` and
81the Python module :mod:`token`.
82
Georg Brandl9cea5112008-06-07 18:17:37 +000083The ST objects are not required to support the functionality of this module,
Georg Brandl8ec7f652007-08-15 14:28:01 +000084but are provided for three purposes: to allow an application to amortize the
85cost of processing complex parse trees, to provide a parse tree representation
86which conserves memory space when compared to the Python list or tuple
87representation, and to ease the creation of additional modules in C which
88manipulate parse trees. A simple "wrapper" class may be created in Python to
Georg Brandl9cea5112008-06-07 18:17:37 +000089hide the use of ST objects.
Georg Brandl8ec7f652007-08-15 14:28:01 +000090
91The :mod:`parser` module defines functions for a few distinct purposes. The
Georg Brandl9cea5112008-06-07 18:17:37 +000092most important purposes are to create ST objects and to convert ST objects to
Georg Brandl8ec7f652007-08-15 14:28:01 +000093other representations such as parse trees and compiled code objects, but there
94are also functions which serve to query the type of parse tree represented by an
Georg Brandl9cea5112008-06-07 18:17:37 +000095ST object.
Georg Brandl8ec7f652007-08-15 14:28:01 +000096
97
98.. seealso::
99
100 Module :mod:`symbol`
101 Useful constants representing internal nodes of the parse tree.
102
103 Module :mod:`token`
104 Useful constants representing leaf nodes of the parse tree and functions for
105 testing node values.
106
107
Georg Brandl9cea5112008-06-07 18:17:37 +0000108.. _creating-sts:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000109
Georg Brandl9cea5112008-06-07 18:17:37 +0000110Creating ST Objects
111-------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +0000112
Georg Brandl9cea5112008-06-07 18:17:37 +0000113ST objects may be created from source code or from a parse tree. When creating
114an ST object from source, different functions are used to create the ``'eval'``
Georg Brandl8ec7f652007-08-15 14:28:01 +0000115and ``'exec'`` forms.
116
117
118.. function:: expr(source)
119
120 The :func:`expr` function parses the parameter *source* as if it were an input
Georg Brandl9cea5112008-06-07 18:17:37 +0000121 to ``compile(source, 'file.py', 'eval')``. If the parse succeeds, an ST object
Georg Brandl8ec7f652007-08-15 14:28:01 +0000122 is created to hold the internal parse tree representation, otherwise an
123 appropriate exception is thrown.
124
125
126.. function:: suite(source)
127
128 The :func:`suite` function parses the parameter *source* as if it were an input
Georg Brandl9cea5112008-06-07 18:17:37 +0000129 to ``compile(source, 'file.py', 'exec')``. If the parse succeeds, an ST object
Georg Brandl8ec7f652007-08-15 14:28:01 +0000130 is created to hold the internal parse tree representation, otherwise an
131 appropriate exception is thrown.
132
133
Georg Brandl9cea5112008-06-07 18:17:37 +0000134.. function:: sequence2st(sequence)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000135
136 This function accepts a parse tree represented as a sequence and builds an
137 internal representation if possible. If it can validate that the tree conforms
138 to the Python grammar and all nodes are valid node types in the host version of
Georg Brandl9cea5112008-06-07 18:17:37 +0000139 Python, an ST object is created from the internal representation and returned
Georg Brandl8ec7f652007-08-15 14:28:01 +0000140 to the called. If there is a problem creating the internal representation, or
141 if the tree cannot be validated, a :exc:`ParserError` exception is thrown. An
Georg Brandl9cea5112008-06-07 18:17:37 +0000142 ST object created this way should not be assumed to compile correctly; normal
143 exceptions thrown by compilation may still be initiated when the ST object is
144 passed to :func:`compilest`. This may indicate problems not related to syntax
Georg Brandl8ec7f652007-08-15 14:28:01 +0000145 (such as a :exc:`MemoryError` exception), but may also be due to constructs such
146 as the result of parsing ``del f(0)``, which escapes the Python parser but is
147 checked by the bytecode compiler.
148
149 Sequences representing terminal tokens may be represented as either two-element
150 lists of the form ``(1, 'name')`` or as three-element lists of the form ``(1,
151 'name', 56)``. If the third element is present, it is assumed to be a valid
152 line number. The line number may be specified for any subset of the terminal
153 symbols in the input tree.
154
155
Georg Brandl9cea5112008-06-07 18:17:37 +0000156.. function:: tuple2st(sequence)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000157
Georg Brandl9cea5112008-06-07 18:17:37 +0000158 This is the same function as :func:`sequence2st`. This entry point is
Georg Brandl8ec7f652007-08-15 14:28:01 +0000159 maintained for backward compatibility.
160
161
Georg Brandl9cea5112008-06-07 18:17:37 +0000162.. _converting-sts:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000163
Georg Brandl9cea5112008-06-07 18:17:37 +0000164Converting ST Objects
165---------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +0000166
Georg Brandl9cea5112008-06-07 18:17:37 +0000167ST objects, regardless of the input used to create them, may be converted to
Georg Brandl8ec7f652007-08-15 14:28:01 +0000168parse trees represented as list- or tuple- trees, or may be compiled into
169executable code objects. Parse trees may be extracted with or without line
170numbering information.
171
172
Georg Brandl9cea5112008-06-07 18:17:37 +0000173.. function:: st2list(ast[, line_info])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000174
Georg Brandl9cea5112008-06-07 18:17:37 +0000175 This function accepts an ST object from the caller in *ast* and returns a
Georg Brandl8ec7f652007-08-15 14:28:01 +0000176 Python list representing the equivalent parse tree. The resulting list
177 representation can be used for inspection or the creation of a new parse tree in
178 list form. This function does not fail so long as memory is available to build
179 the list representation. If the parse tree will only be used for inspection,
Georg Brandl9cea5112008-06-07 18:17:37 +0000180 :func:`st2tuple` should be used instead to reduce memory consumption and
Georg Brandl8ec7f652007-08-15 14:28:01 +0000181 fragmentation. When the list representation is required, this function is
182 significantly faster than retrieving a tuple representation and converting that
183 to nested lists.
184
185 If *line_info* is true, line number information will be included for all
186 terminal tokens as a third element of the list representing the token. Note
187 that the line number provided specifies the line on which the token *ends*.
188 This information is omitted if the flag is false or omitted.
189
190
Georg Brandl9cea5112008-06-07 18:17:37 +0000191.. function:: st2tuple(ast[, line_info])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000192
Georg Brandl9cea5112008-06-07 18:17:37 +0000193 This function accepts an ST object from the caller in *ast* and returns a
Georg Brandl8ec7f652007-08-15 14:28:01 +0000194 Python tuple representing the equivalent parse tree. Other than returning a
Georg Brandl9cea5112008-06-07 18:17:37 +0000195 tuple instead of a list, this function is identical to :func:`st2list`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000196
197 If *line_info* is true, line number information will be included for all
198 terminal tokens as a third element of the list representing the token. This
199 information is omitted if the flag is false or omitted.
200
201
Georg Brandl9cea5112008-06-07 18:17:37 +0000202.. function:: compilest(ast[, filename='<syntax-tree>'])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000203
204 .. index:: builtin: eval
205
Georg Brandl9cea5112008-06-07 18:17:37 +0000206 The Python byte compiler can be invoked on an ST object to produce code objects
Georg Brandl8ec7f652007-08-15 14:28:01 +0000207 which can be used as part of an :keyword:`exec` statement or a call to the
208 built-in :func:`eval` function. This function provides the interface to the
209 compiler, passing the internal parse tree from *ast* to the parser, using the
210 source file name specified by the *filename* parameter. The default value
Georg Brandl9cea5112008-06-07 18:17:37 +0000211 supplied for *filename* indicates that the source was an ST object.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000212
Georg Brandl9cea5112008-06-07 18:17:37 +0000213 Compiling an ST object may result in exceptions related to compilation; an
Georg Brandl8ec7f652007-08-15 14:28:01 +0000214 example would be a :exc:`SyntaxError` caused by the parse tree for ``del f(0)``:
215 this statement is considered legal within the formal grammar for Python but is
216 not a legal language construct. The :exc:`SyntaxError` raised for this
217 condition is actually generated by the Python byte-compiler normally, which is
218 why it can be raised at this point by the :mod:`parser` module. Most causes of
219 compilation failure can be diagnosed programmatically by inspection of the parse
220 tree.
221
222
Georg Brandl9cea5112008-06-07 18:17:37 +0000223.. _querying-sts:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000224
Georg Brandl9cea5112008-06-07 18:17:37 +0000225Queries on ST Objects
226---------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +0000227
Georg Brandl9cea5112008-06-07 18:17:37 +0000228Two functions are provided which allow an application to determine if an ST was
Georg Brandl8ec7f652007-08-15 14:28:01 +0000229created as an expression or a suite. Neither of these functions can be used to
Georg Brandl9cea5112008-06-07 18:17:37 +0000230determine if an ST was created from source code via :func:`expr` or
231:func:`suite` or from a parse tree via :func:`sequence2st`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000232
233
234.. function:: isexpr(ast)
235
236 .. index:: builtin: compile
237
238 When *ast* represents an ``'eval'`` form, this function returns true, otherwise
239 it returns false. This is useful, since code objects normally cannot be queried
240 for this information using existing built-in functions. Note that the code
Georg Brandl9cea5112008-06-07 18:17:37 +0000241 objects created by :func:`compilest` cannot be queried like this either, and
Georg Brandl8ec7f652007-08-15 14:28:01 +0000242 are identical to those created by the built-in :func:`compile` function.
243
244
245.. function:: issuite(ast)
246
Georg Brandl9cea5112008-06-07 18:17:37 +0000247 This function mirrors :func:`isexpr` in that it reports whether an ST object
Georg Brandl8ec7f652007-08-15 14:28:01 +0000248 represents an ``'exec'`` form, commonly known as a "suite." It is not safe to
249 assume that this function is equivalent to ``not isexpr(ast)``, as additional
250 syntactic fragments may be supported in the future.
251
252
Georg Brandl9cea5112008-06-07 18:17:37 +0000253.. _st-errors:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000254
255Exceptions and Error Handling
256-----------------------------
257
258The parser module defines a single exception, but may also pass other built-in
259exceptions from other portions of the Python runtime environment. See each
260function for information about the exceptions it can raise.
261
262
263.. exception:: ParserError
264
265 Exception raised when a failure occurs within the parser module. This is
266 generally produced for validation failures rather than the built in
267 :exc:`SyntaxError` thrown during normal parsing. The exception argument is
268 either a string describing the reason of the failure or a tuple containing a
Georg Brandl9cea5112008-06-07 18:17:37 +0000269 sequence causing the failure from a parse tree passed to :func:`sequence2st`
270 and an explanatory string. Calls to :func:`sequence2st` need to be able to
Georg Brandl8ec7f652007-08-15 14:28:01 +0000271 handle either type of exception, while calls to other functions in the module
272 will only need to be aware of the simple string values.
273
Georg Brandl9cea5112008-06-07 18:17:37 +0000274Note that the functions :func:`compilest`, :func:`expr`, and :func:`suite` may
Georg Brandl8ec7f652007-08-15 14:28:01 +0000275throw exceptions which are normally thrown by the parsing and compilation
276process. These include the built in exceptions :exc:`MemoryError`,
277:exc:`OverflowError`, :exc:`SyntaxError`, and :exc:`SystemError`. In these
278cases, these exceptions carry all the meaning normally associated with them.
279Refer to the descriptions of each function for detailed information.
280
281
Georg Brandl9cea5112008-06-07 18:17:37 +0000282.. _st-objects:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000283
Georg Brandl9cea5112008-06-07 18:17:37 +0000284ST Objects
285----------
Georg Brandl8ec7f652007-08-15 14:28:01 +0000286
Georg Brandl9cea5112008-06-07 18:17:37 +0000287Ordered and equality comparisons are supported between ST objects. Pickling of
288ST objects (using the :mod:`pickle` module) is also supported.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000289
290
Georg Brandl9cea5112008-06-07 18:17:37 +0000291.. data:: STType
Georg Brandl8ec7f652007-08-15 14:28:01 +0000292
293 The type of the objects returned by :func:`expr`, :func:`suite` and
Georg Brandl9cea5112008-06-07 18:17:37 +0000294 :func:`sequence2st`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000295
Georg Brandl9cea5112008-06-07 18:17:37 +0000296ST objects have the following methods:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000297
298
Georg Brandl9cea5112008-06-07 18:17:37 +0000299.. method:: ST.compile([filename])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000300
Georg Brandl9cea5112008-06-07 18:17:37 +0000301 Same as ``compilest(st, filename)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000302
303
Georg Brandl9cea5112008-06-07 18:17:37 +0000304.. method:: ST.isexpr()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000305
Georg Brandl9cea5112008-06-07 18:17:37 +0000306 Same as ``isexpr(st)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000307
308
Georg Brandl9cea5112008-06-07 18:17:37 +0000309.. method:: ST.issuite()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000310
Georg Brandl9cea5112008-06-07 18:17:37 +0000311 Same as ``issuite(st)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000312
313
Georg Brandl9cea5112008-06-07 18:17:37 +0000314.. method:: ST.tolist([line_info])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000315
Georg Brandl9cea5112008-06-07 18:17:37 +0000316 Same as ``st2list(st, line_info)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000317
318
Georg Brandl9cea5112008-06-07 18:17:37 +0000319.. method:: ST.totuple([line_info])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000320
Georg Brandl9cea5112008-06-07 18:17:37 +0000321 Same as ``st2tuple(st, line_info)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000322
323
Georg Brandl9cea5112008-06-07 18:17:37 +0000324.. _st-examples:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000325
326Examples
327--------
328
329.. index:: builtin: compile
330
331The parser modules allows operations to be performed on the parse tree of Python
Georg Brandl63fa1682007-10-21 10:24:20 +0000332source code before the :term:`bytecode` is generated, and provides for inspection of the
Georg Brandl8ec7f652007-08-15 14:28:01 +0000333parse tree for information gathering purposes. Two examples are presented. The
334simple example demonstrates emulation of the :func:`compile` built-in function
335and the complex example shows the use of a parse tree for information discovery.
336
337
338Emulation of :func:`compile`
339^^^^^^^^^^^^^^^^^^^^^^^^^^^^
340
341While many useful operations may take place between parsing and bytecode
342generation, the simplest operation is to do nothing. For this purpose, using
343the :mod:`parser` module to produce an intermediate data structure is equivalent
344to the code ::
345
346 >>> code = compile('a + 5', 'file.py', 'eval')
347 >>> a = 5
348 >>> eval(code)
349 10
350
351The equivalent operation using the :mod:`parser` module is somewhat longer, and
Georg Brandl9cea5112008-06-07 18:17:37 +0000352allows the intermediate internal parse tree to be retained as an ST object::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000353
354 >>> import parser
Georg Brandl9cea5112008-06-07 18:17:37 +0000355 >>> st = parser.expr('a + 5')
356 >>> code = st.compile('file.py')
Georg Brandl8ec7f652007-08-15 14:28:01 +0000357 >>> a = 5
358 >>> eval(code)
359 10
360
Georg Brandl9cea5112008-06-07 18:17:37 +0000361An application which needs both ST and code objects can package this code into
Georg Brandl8ec7f652007-08-15 14:28:01 +0000362readily available functions::
363
364 import parser
365
366 def load_suite(source_string):
Georg Brandl9cea5112008-06-07 18:17:37 +0000367 st = parser.suite(source_string)
368 return st, st.compile()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000369
370 def load_expression(source_string):
Georg Brandl9cea5112008-06-07 18:17:37 +0000371 st = parser.expr(source_string)
372 return st, st.compile()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000373
374
375Information Discovery
376^^^^^^^^^^^^^^^^^^^^^
377
378.. index::
379 single: string; documentation
380 single: docstrings
381
382Some applications benefit from direct access to the parse tree. The remainder
383of this section demonstrates how the parse tree provides access to module
384documentation defined in docstrings without requiring that the code being
385examined be loaded into a running interpreter via :keyword:`import`. This can
386be very useful for performing analyses of untrusted code.
387
388Generally, the example will demonstrate how the parse tree may be traversed to
389distill interesting information. Two functions and a set of classes are
390developed which provide programmatic access to high level function and class
391definitions provided by a module. The classes extract information from the
392parse tree and provide access to the information at a useful semantic level, one
393function provides a simple low-level pattern matching capability, and the other
394function defines a high-level interface to the classes by handling file
395operations on behalf of the caller. All source files mentioned here which are
396not part of the Python installation are located in the :file:`Demo/parser/`
397directory of the distribution.
398
399The dynamic nature of Python allows the programmer a great deal of flexibility,
400but most modules need only a limited measure of this when defining classes,
401functions, and methods. In this example, the only definitions that will be
402considered are those which are defined in the top level of their context, e.g.,
403a function defined by a :keyword:`def` statement at column zero of a module, but
404not a function defined within a branch of an :keyword:`if` ... :keyword:`else`
405construct, though there are some good reasons for doing so in some situations.
406Nesting of definitions will be handled by the code developed in the example.
407
408To construct the upper-level extraction methods, we need to know what the parse
409tree structure looks like and how much of it we actually need to be concerned
410about. Python uses a moderately deep parse tree so there are a large number of
411intermediate nodes. It is important to read and understand the formal grammar
412used by Python. This is specified in the file :file:`Grammar/Grammar` in the
413distribution. Consider the simplest case of interest when searching for
414docstrings: a module consisting of a docstring and nothing else. (See file
415:file:`docstring.py`.) ::
416
417 """Some documentation.
418 """
419
420Using the interpreter to take a look at the parse tree, we find a bewildering
421mass of numbers and parentheses, with the documentation buried deep in nested
422tuples. ::
423
424 >>> import parser
425 >>> import pprint
Georg Brandl9cea5112008-06-07 18:17:37 +0000426 >>> st = parser.suite(open('docstring.py').read())
427 >>> tup = st.totuple()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000428 >>> pprint.pprint(tup)
429 (257,
430 (264,
431 (265,
432 (266,
433 (267,
434 (307,
435 (287,
436 (288,
437 (289,
438 (290,
439 (292,
440 (293,
441 (294,
442 (295,
443 (296,
444 (297,
445 (298,
446 (299,
447 (300, (3, '"""Some documentation.\n"""'))))))))))))))))),
448 (4, ''))),
449 (4, ''),
450 (0, ''))
451
452The numbers at the first element of each node in the tree are the node types;
453they map directly to terminal and non-terminal symbols in the grammar.
454Unfortunately, they are represented as integers in the internal representation,
455and the Python structures generated do not change that. However, the
456:mod:`symbol` and :mod:`token` modules provide symbolic names for the node types
457and dictionaries which map from the integers to the symbolic names for the node
458types.
459
460In the output presented above, the outermost tuple contains four elements: the
461integer ``257`` and three additional tuples. Node type ``257`` has the symbolic
462name :const:`file_input`. Each of these inner tuples contains an integer as the
463first element; these integers, ``264``, ``4``, and ``0``, represent the node
464types :const:`stmt`, :const:`NEWLINE`, and :const:`ENDMARKER`, respectively.
465Note that these values may change depending on the version of Python you are
466using; consult :file:`symbol.py` and :file:`token.py` for details of the
467mapping. It should be fairly clear that the outermost node is related primarily
468to the input source rather than the contents of the file, and may be disregarded
469for the moment. The :const:`stmt` node is much more interesting. In
470particular, all docstrings are found in subtrees which are formed exactly as
471this node is formed, with the only difference being the string itself. The
472association between the docstring in a similar tree and the defined entity
473(class, function, or module) which it describes is given by the position of the
474docstring subtree within the tree defining the described structure.
475
476By replacing the actual docstring with something to signify a variable component
477of the tree, we allow a simple pattern matching approach to check any given
478subtree for equivalence to the general pattern for docstrings. Since the
479example demonstrates information extraction, we can safely require that the tree
480be in tuple form rather than list form, allowing a simple variable
481representation to be ``['variable_name']``. A simple recursive function can
482implement the pattern matching, returning a Boolean and a dictionary of variable
483name to value mappings. (See file :file:`example.py`.) ::
484
485 from types import ListType, TupleType
486
487 def match(pattern, data, vars=None):
488 if vars is None:
489 vars = {}
490 if type(pattern) is ListType:
491 vars[pattern[0]] = data
492 return 1, vars
493 if type(pattern) is not TupleType:
494 return (pattern == data), vars
495 if len(data) != len(pattern):
496 return 0, vars
497 for pattern, data in map(None, pattern, data):
498 same, vars = match(pattern, data, vars)
499 if not same:
500 break
501 return same, vars
502
503Using this simple representation for syntactic variables and the symbolic node
504types, the pattern for the candidate docstring subtrees becomes fairly readable.
505(See file :file:`example.py`.) ::
506
507 import symbol
508 import token
509
510 DOCSTRING_STMT_PATTERN = (
511 symbol.stmt,
512 (symbol.simple_stmt,
513 (symbol.small_stmt,
514 (symbol.expr_stmt,
515 (symbol.testlist,
516 (symbol.test,
517 (symbol.and_test,
518 (symbol.not_test,
519 (symbol.comparison,
520 (symbol.expr,
521 (symbol.xor_expr,
522 (symbol.and_expr,
523 (symbol.shift_expr,
524 (symbol.arith_expr,
525 (symbol.term,
526 (symbol.factor,
527 (symbol.power,
528 (symbol.atom,
529 (token.STRING, ['docstring'])
530 )))))))))))))))),
531 (token.NEWLINE, '')
532 ))
533
534Using the :func:`match` function with this pattern, extracting the module
535docstring from the parse tree created previously is easy::
536
537 >>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1])
538 >>> found
539 1
540 >>> vars
541 {'docstring': '"""Some documentation.\n"""'}
542
543Once specific data can be extracted from a location where it is expected, the
544question of where information can be expected needs to be answered. When
545dealing with docstrings, the answer is fairly simple: the docstring is the first
546:const:`stmt` node in a code block (:const:`file_input` or :const:`suite` node
547types). A module consists of a single :const:`file_input` node, and class and
548function definitions each contain exactly one :const:`suite` node. Classes and
549functions are readily identified as subtrees of code block nodes which start
550with ``(stmt, (compound_stmt, (classdef, ...`` or ``(stmt, (compound_stmt,
551(funcdef, ...``. Note that these subtrees cannot be matched by :func:`match`
552since it does not support multiple sibling nodes to match without regard to
553number. A more elaborate matching function could be used to overcome this
554limitation, but this is sufficient for the example.
555
556Given the ability to determine whether a statement might be a docstring and
557extract the actual string from the statement, some work needs to be performed to
558walk the parse tree for an entire module and extract information about the names
559defined in each context of the module and associate any docstrings with the
560names. The code to perform this work is not complicated, but bears some
561explanation.
562
563The public interface to the classes is straightforward and should probably be
564somewhat more flexible. Each "major" block of the module is described by an
565object providing several methods for inquiry and a constructor which accepts at
566least the subtree of the complete parse tree which it represents. The
567:class:`ModuleInfo` constructor accepts an optional *name* parameter since it
568cannot otherwise determine the name of the module.
569
570The public classes include :class:`ClassInfo`, :class:`FunctionInfo`, and
571:class:`ModuleInfo`. All objects provide the methods :meth:`get_name`,
572:meth:`get_docstring`, :meth:`get_class_names`, and :meth:`get_class_info`. The
573:class:`ClassInfo` objects support :meth:`get_method_names` and
574:meth:`get_method_info` while the other classes provide
575:meth:`get_function_names` and :meth:`get_function_info`.
576
577Within each of the forms of code block that the public classes represent, most
578of the required information is in the same form and is accessed in the same way,
579with classes having the distinction that functions defined at the top level are
580referred to as "methods." Since the difference in nomenclature reflects a real
581semantic distinction from functions defined outside of a class, the
582implementation needs to maintain the distinction. Hence, most of the
583functionality of the public classes can be implemented in a common base class,
584:class:`SuiteInfoBase`, with the accessors for function and method information
585provided elsewhere. Note that there is only one class which represents function
586and method information; this parallels the use of the :keyword:`def` statement
587to define both types of elements.
588
589Most of the accessor functions are declared in :class:`SuiteInfoBase` and do not
590need to be overridden by subclasses. More importantly, the extraction of most
591information from a parse tree is handled through a method called by the
592:class:`SuiteInfoBase` constructor. The example code for most of the classes is
593clear when read alongside the formal grammar, but the method which recursively
594creates new information objects requires further examination. Here is the
595relevant part of the :class:`SuiteInfoBase` definition from :file:`example.py`::
596
597 class SuiteInfoBase:
598 _docstring = ''
599 _name = ''
600
601 def __init__(self, tree = None):
602 self._class_info = {}
603 self._function_info = {}
604 if tree:
605 self._extract_info(tree)
606
607 def _extract_info(self, tree):
608 # extract docstring
609 if len(tree) == 2:
610 found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1])
611 else:
612 found, vars = match(DOCSTRING_STMT_PATTERN, tree[3])
613 if found:
614 self._docstring = eval(vars['docstring'])
615 # discover inner definitions
616 for node in tree[1:]:
617 found, vars = match(COMPOUND_STMT_PATTERN, node)
618 if found:
619 cstmt = vars['compound']
620 if cstmt[0] == symbol.funcdef:
621 name = cstmt[2][1]
622 self._function_info[name] = FunctionInfo(cstmt)
623 elif cstmt[0] == symbol.classdef:
624 name = cstmt[2][1]
625 self._class_info[name] = ClassInfo(cstmt)
626
627After initializing some internal state, the constructor calls the
628:meth:`_extract_info` method. This method performs the bulk of the information
629extraction which takes place in the entire example. The extraction has two
630distinct phases: the location of the docstring for the parse tree passed in, and
631the discovery of additional definitions within the code block represented by the
632parse tree.
633
634The initial :keyword:`if` test determines whether the nested suite is of the
635"short form" or the "long form." The short form is used when the code block is
636on the same line as the definition of the code block, as in ::
637
638 def square(x): "Square an argument."; return x ** 2
639
640while the long form uses an indented block and allows nested definitions::
641
642 def make_power(exp):
643 "Make a function that raises an argument to the exponent `exp'."
644 def raiser(x, y=exp):
645 return x ** y
646 return raiser
647
648When the short form is used, the code block may contain a docstring as the
649first, and possibly only, :const:`small_stmt` element. The extraction of such a
650docstring is slightly different and requires only a portion of the complete
651pattern used in the more common case. As implemented, the docstring will only
652be found if there is only one :const:`small_stmt` node in the
653:const:`simple_stmt` node. Since most functions and methods which use the short
654form do not provide a docstring, this may be considered sufficient. The
655extraction of the docstring proceeds using the :func:`match` function as
656described above, and the value of the docstring is stored as an attribute of the
657:class:`SuiteInfoBase` object.
658
659After docstring extraction, a simple definition discovery algorithm operates on
660the :const:`stmt` nodes of the :const:`suite` node. The special case of the
661short form is not tested; since there are no :const:`stmt` nodes in the short
662form, the algorithm will silently skip the single :const:`simple_stmt` node and
663correctly not discover any nested definitions.
664
665Each statement in the code block is categorized as a class definition, function
666or method definition, or something else. For the definition statements, the
667name of the element defined is extracted and a representation object appropriate
668to the definition is created with the defining subtree passed as an argument to
669the constructor. The representation objects are stored in instance variables
670and may be retrieved by name using the appropriate accessor methods.
671
672The public classes provide any accessors required which are more specific than
673those provided by the :class:`SuiteInfoBase` class, but the real extraction
674algorithm remains common to all forms of code blocks. A high-level function can
675be used to extract the complete set of information from a source file. (See
676file :file:`example.py`.) ::
677
678 def get_docs(fileName):
679 import os
680 import parser
681
682 source = open(fileName).read()
683 basename = os.path.basename(os.path.splitext(fileName)[0])
Georg Brandl9cea5112008-06-07 18:17:37 +0000684 st = parser.suite(source)
685 return ModuleInfo(st.totuple(), basename)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000686
687This provides an easy-to-use interface to the documentation of a module. If
688information is required which is not extracted by the code of this example, the
689code may be extended at clearly defined points to provide additional
690capabilities.
691