blob: 8fba6dfc8784777a65e613f7f9316c86dbf1d710 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`parser` --- Access Python parse trees
3===========================================
4
5.. module:: parser
6 :synopsis: Access parse trees for Python source code.
7.. moduleauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
9
10
Georg Brandlb19be572007-12-29 10:57:00 +000011.. Copyright 1995 Virginia Polytechnic Institute and State University and Fred
12 L. Drake, Jr. This copyright notice must be distributed on all copies, but
13 this document otherwise may be distributed as part of the Python
14 distribution. No fee may be charged for this document in any representation,
15 either on paper or electronically. This restriction does not affect other
16 elements in a distributed package in any way.
Georg Brandl8ec7f652007-08-15 14:28:01 +000017
18.. index:: single: parsing; Python source code
19
20The :mod:`parser` module provides an interface to Python's internal parser and
21byte-code compiler. The primary purpose for this interface is to allow Python
22code to edit the parse tree of a Python expression and create executable code
23from this. This is better than trying to parse and modify an arbitrary Python
24code fragment as a string because parsing is performed in a manner identical to
25the code forming the application. It is also faster.
26
Georg Brandl9cea5112008-06-07 18:17:37 +000027.. note::
28
29 From Python 2.5 onward, it's much more convenient to cut in at the Abstract
30 Syntax Tree (AST) generation and compilation stage, using the :mod:`ast`
31 module.
32
33 The :mod:`parser` module exports the names documented here also with "st"
34 replaced by "ast"; this is a legacy from the time when there was no other
35 AST and has nothing to do with the AST found in Python 2.5. This is also the
36 reason for the functions' keyword arguments being called *ast*, not *st*.
Georg Brandlf9efabb2008-07-23 15:16:45 +000037 The "ast" functions will be removed in Python 3.0.
Georg Brandl9cea5112008-06-07 18:17:37 +000038
Georg Brandl8ec7f652007-08-15 14:28:01 +000039There are a few things to note about this module which are important to making
40use of the data structures created. This is not a tutorial on editing the parse
41trees for Python code, but some examples of using the :mod:`parser` module are
42presented.
43
44Most importantly, a good understanding of the Python grammar processed by the
45internal parser is required. For full information on the language syntax, refer
46to :ref:`reference-index`. The parser
47itself is created from a grammar specification defined in the file
48:file:`Grammar/Grammar` in the standard Python distribution. The parse trees
Georg Brandl9cea5112008-06-07 18:17:37 +000049stored in the ST objects created by this module are the actual output from the
Georg Brandl8ec7f652007-08-15 14:28:01 +000050internal parser when created by the :func:`expr` or :func:`suite` functions,
Georg Brandl9cea5112008-06-07 18:17:37 +000051described below. The ST objects created by :func:`sequence2st` faithfully
Georg Brandl8ec7f652007-08-15 14:28:01 +000052simulate those structures. Be aware that the values of the sequences which are
53considered "correct" will vary from one version of Python to another as the
54formal grammar for the language is revised. However, transporting code from one
55Python version to another as source text will always allow correct parse trees
56to be created in the target version, with the only restriction being that
57migrating to an older version of the interpreter will not support more recent
58language constructs. The parse trees are not typically compatible from one
59version to another, whereas source code has always been forward-compatible.
60
Georg Brandl9cea5112008-06-07 18:17:37 +000061Each element of the sequences returned by :func:`st2list` or :func:`st2tuple`
Georg Brandl8ec7f652007-08-15 14:28:01 +000062has a simple form. Sequences representing non-terminal elements in the grammar
63always have a length greater than one. The first element is an integer which
64identifies a production in the grammar. These integers are given symbolic names
65in the C header file :file:`Include/graminit.h` and the Python module
66:mod:`symbol`. Each additional element of the sequence represents a component
67of the production as recognized in the input string: these are always sequences
68which have the same form as the parent. An important aspect of this structure
69which should be noted is that keywords used to identify the parent node type,
70such as the keyword :keyword:`if` in an :const:`if_stmt`, are included in the
71node tree without any special treatment. For example, the :keyword:`if` keyword
72is represented by the tuple ``(1, 'if')``, where ``1`` is the numeric value
73associated with all :const:`NAME` tokens, including variable and function names
74defined by the user. In an alternate form returned when line number information
75is requested, the same token might be represented as ``(1, 'if', 12)``, where
76the ``12`` represents the line number at which the terminal symbol was found.
77
78Terminal elements are represented in much the same way, but without any child
79elements and the addition of the source text which was identified. The example
80of the :keyword:`if` keyword above is representative. The various types of
81terminal symbols are defined in the C header file :file:`Include/token.h` and
82the Python module :mod:`token`.
83
Georg Brandl9cea5112008-06-07 18:17:37 +000084The ST objects are not required to support the functionality of this module,
Georg Brandl8ec7f652007-08-15 14:28:01 +000085but are provided for three purposes: to allow an application to amortize the
86cost of processing complex parse trees, to provide a parse tree representation
87which conserves memory space when compared to the Python list or tuple
88representation, and to ease the creation of additional modules in C which
89manipulate parse trees. A simple "wrapper" class may be created in Python to
Georg Brandl9cea5112008-06-07 18:17:37 +000090hide the use of ST objects.
Georg Brandl8ec7f652007-08-15 14:28:01 +000091
92The :mod:`parser` module defines functions for a few distinct purposes. The
Georg Brandl9cea5112008-06-07 18:17:37 +000093most important purposes are to create ST objects and to convert ST objects to
Georg Brandl8ec7f652007-08-15 14:28:01 +000094other representations such as parse trees and compiled code objects, but there
95are also functions which serve to query the type of parse tree represented by an
Georg Brandl9cea5112008-06-07 18:17:37 +000096ST object.
Georg Brandl8ec7f652007-08-15 14:28:01 +000097
98
99.. seealso::
100
101 Module :mod:`symbol`
102 Useful constants representing internal nodes of the parse tree.
103
104 Module :mod:`token`
105 Useful constants representing leaf nodes of the parse tree and functions for
106 testing node values.
107
108
Georg Brandl9cea5112008-06-07 18:17:37 +0000109.. _creating-sts:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000110
Georg Brandl9cea5112008-06-07 18:17:37 +0000111Creating ST Objects
112-------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +0000113
Georg Brandl9cea5112008-06-07 18:17:37 +0000114ST objects may be created from source code or from a parse tree. When creating
115an ST object from source, different functions are used to create the ``'eval'``
Georg Brandl8ec7f652007-08-15 14:28:01 +0000116and ``'exec'`` forms.
117
118
119.. function:: expr(source)
120
121 The :func:`expr` function parses the parameter *source* as if it were an input
Georg Brandl9cea5112008-06-07 18:17:37 +0000122 to ``compile(source, 'file.py', 'eval')``. If the parse succeeds, an ST object
Georg Brandl8ec7f652007-08-15 14:28:01 +0000123 is created to hold the internal parse tree representation, otherwise an
Georg Brandl21946af2010-10-06 09:28:45 +0000124 appropriate exception is raised.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000125
126
127.. function:: suite(source)
128
129 The :func:`suite` function parses the parameter *source* as if it were an input
Georg Brandl9cea5112008-06-07 18:17:37 +0000130 to ``compile(source, 'file.py', 'exec')``. If the parse succeeds, an ST object
Georg Brandl8ec7f652007-08-15 14:28:01 +0000131 is created to hold the internal parse tree representation, otherwise an
Georg Brandl21946af2010-10-06 09:28:45 +0000132 appropriate exception is raised.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000133
134
Georg Brandl9cea5112008-06-07 18:17:37 +0000135.. function:: sequence2st(sequence)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000136
137 This function accepts a parse tree represented as a sequence and builds an
138 internal representation if possible. If it can validate that the tree conforms
139 to the Python grammar and all nodes are valid node types in the host version of
Georg Brandl9cea5112008-06-07 18:17:37 +0000140 Python, an ST object is created from the internal representation and returned
Georg Brandl8ec7f652007-08-15 14:28:01 +0000141 to the called. If there is a problem creating the internal representation, or
Georg Brandl21946af2010-10-06 09:28:45 +0000142 if the tree cannot be validated, a :exc:`ParserError` exception is raised. An
Georg Brandl9cea5112008-06-07 18:17:37 +0000143 ST object created this way should not be assumed to compile correctly; normal
Georg Brandl21946af2010-10-06 09:28:45 +0000144 exceptions raised by compilation may still be initiated when the ST object is
Georg Brandl9cea5112008-06-07 18:17:37 +0000145 passed to :func:`compilest`. This may indicate problems not related to syntax
Georg Brandl8ec7f652007-08-15 14:28:01 +0000146 (such as a :exc:`MemoryError` exception), but may also be due to constructs such
147 as the result of parsing ``del f(0)``, which escapes the Python parser but is
148 checked by the bytecode compiler.
149
150 Sequences representing terminal tokens may be represented as either two-element
151 lists of the form ``(1, 'name')`` or as three-element lists of the form ``(1,
152 'name', 56)``. If the third element is present, it is assumed to be a valid
153 line number. The line number may be specified for any subset of the terminal
154 symbols in the input tree.
155
156
Georg Brandl9cea5112008-06-07 18:17:37 +0000157.. function:: tuple2st(sequence)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000158
Georg Brandl9cea5112008-06-07 18:17:37 +0000159 This is the same function as :func:`sequence2st`. This entry point is
Georg Brandl8ec7f652007-08-15 14:28:01 +0000160 maintained for backward compatibility.
161
162
Georg Brandl9cea5112008-06-07 18:17:37 +0000163.. _converting-sts:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000164
Georg Brandl9cea5112008-06-07 18:17:37 +0000165Converting ST Objects
166---------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +0000167
Georg Brandl9cea5112008-06-07 18:17:37 +0000168ST objects, regardless of the input used to create them, may be converted to
Georg Brandl8ec7f652007-08-15 14:28:01 +0000169parse trees represented as list- or tuple- trees, or may be compiled into
170executable code objects. Parse trees may be extracted with or without line
171numbering information.
172
173
Georg Brandl9cea5112008-06-07 18:17:37 +0000174.. function:: st2list(ast[, line_info])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000175
Georg Brandl9cea5112008-06-07 18:17:37 +0000176 This function accepts an ST object from the caller in *ast* and returns a
Georg Brandl8ec7f652007-08-15 14:28:01 +0000177 Python list representing the equivalent parse tree. The resulting list
178 representation can be used for inspection or the creation of a new parse tree in
179 list form. This function does not fail so long as memory is available to build
180 the list representation. If the parse tree will only be used for inspection,
Georg Brandl9cea5112008-06-07 18:17:37 +0000181 :func:`st2tuple` should be used instead to reduce memory consumption and
Georg Brandl8ec7f652007-08-15 14:28:01 +0000182 fragmentation. When the list representation is required, this function is
183 significantly faster than retrieving a tuple representation and converting that
184 to nested lists.
185
186 If *line_info* is true, line number information will be included for all
187 terminal tokens as a third element of the list representing the token. Note
188 that the line number provided specifies the line on which the token *ends*.
189 This information is omitted if the flag is false or omitted.
190
191
Georg Brandl9cea5112008-06-07 18:17:37 +0000192.. function:: st2tuple(ast[, line_info])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000193
Georg Brandl9cea5112008-06-07 18:17:37 +0000194 This function accepts an ST object from the caller in *ast* and returns a
Georg Brandl8ec7f652007-08-15 14:28:01 +0000195 Python tuple representing the equivalent parse tree. Other than returning a
Georg Brandl9cea5112008-06-07 18:17:37 +0000196 tuple instead of a list, this function is identical to :func:`st2list`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000197
198 If *line_info* is true, line number information will be included for all
199 terminal tokens as a third element of the list representing the token. This
200 information is omitted if the flag is false or omitted.
201
202
Georg Brandl9cea5112008-06-07 18:17:37 +0000203.. function:: compilest(ast[, filename='<syntax-tree>'])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000204
205 .. index:: builtin: eval
206
Georg Brandl9cea5112008-06-07 18:17:37 +0000207 The Python byte compiler can be invoked on an ST object to produce code objects
Georg Brandl8ec7f652007-08-15 14:28:01 +0000208 which can be used as part of an :keyword:`exec` statement or a call to the
209 built-in :func:`eval` function. This function provides the interface to the
210 compiler, passing the internal parse tree from *ast* to the parser, using the
211 source file name specified by the *filename* parameter. The default value
Georg Brandl9cea5112008-06-07 18:17:37 +0000212 supplied for *filename* indicates that the source was an ST object.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000213
Georg Brandl9cea5112008-06-07 18:17:37 +0000214 Compiling an ST object may result in exceptions related to compilation; an
Georg Brandl8ec7f652007-08-15 14:28:01 +0000215 example would be a :exc:`SyntaxError` caused by the parse tree for ``del f(0)``:
216 this statement is considered legal within the formal grammar for Python but is
217 not a legal language construct. The :exc:`SyntaxError` raised for this
218 condition is actually generated by the Python byte-compiler normally, which is
219 why it can be raised at this point by the :mod:`parser` module. Most causes of
220 compilation failure can be diagnosed programmatically by inspection of the parse
221 tree.
222
223
Georg Brandl9cea5112008-06-07 18:17:37 +0000224.. _querying-sts:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000225
Georg Brandl9cea5112008-06-07 18:17:37 +0000226Queries on ST Objects
227---------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +0000228
Georg Brandl9cea5112008-06-07 18:17:37 +0000229Two functions are provided which allow an application to determine if an ST was
Georg Brandl8ec7f652007-08-15 14:28:01 +0000230created as an expression or a suite. Neither of these functions can be used to
Georg Brandl9cea5112008-06-07 18:17:37 +0000231determine if an ST was created from source code via :func:`expr` or
232:func:`suite` or from a parse tree via :func:`sequence2st`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000233
234
235.. function:: isexpr(ast)
236
237 .. index:: builtin: compile
238
239 When *ast* represents an ``'eval'`` form, this function returns true, otherwise
240 it returns false. This is useful, since code objects normally cannot be queried
241 for this information using existing built-in functions. Note that the code
Georg Brandl9cea5112008-06-07 18:17:37 +0000242 objects created by :func:`compilest` cannot be queried like this either, and
Georg Brandl8ec7f652007-08-15 14:28:01 +0000243 are identical to those created by the built-in :func:`compile` function.
244
245
246.. function:: issuite(ast)
247
Georg Brandl9cea5112008-06-07 18:17:37 +0000248 This function mirrors :func:`isexpr` in that it reports whether an ST object
Georg Brandl8ec7f652007-08-15 14:28:01 +0000249 represents an ``'exec'`` form, commonly known as a "suite." It is not safe to
250 assume that this function is equivalent to ``not isexpr(ast)``, as additional
251 syntactic fragments may be supported in the future.
252
253
Georg Brandl9cea5112008-06-07 18:17:37 +0000254.. _st-errors:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000255
256Exceptions and Error Handling
257-----------------------------
258
259The parser module defines a single exception, but may also pass other built-in
260exceptions from other portions of the Python runtime environment. See each
261function for information about the exceptions it can raise.
262
263
264.. exception:: ParserError
265
266 Exception raised when a failure occurs within the parser module. This is
Georg Brandl21946af2010-10-06 09:28:45 +0000267 generally produced for validation failures rather than the built-in
268 :exc:`SyntaxError` raised during normal parsing. The exception argument is
Georg Brandl8ec7f652007-08-15 14:28:01 +0000269 either a string describing the reason of the failure or a tuple containing a
Georg Brandl9cea5112008-06-07 18:17:37 +0000270 sequence causing the failure from a parse tree passed to :func:`sequence2st`
271 and an explanatory string. Calls to :func:`sequence2st` need to be able to
Georg Brandl8ec7f652007-08-15 14:28:01 +0000272 handle either type of exception, while calls to other functions in the module
273 will only need to be aware of the simple string values.
274
Georg Brandl9cea5112008-06-07 18:17:37 +0000275Note that the functions :func:`compilest`, :func:`expr`, and :func:`suite` may
Georg Brandl21946af2010-10-06 09:28:45 +0000276raise exceptions which are normally thrown by the parsing and compilation
Georg Brandl8ec7f652007-08-15 14:28:01 +0000277process. These include the built in exceptions :exc:`MemoryError`,
278:exc:`OverflowError`, :exc:`SyntaxError`, and :exc:`SystemError`. In these
279cases, these exceptions carry all the meaning normally associated with them.
280Refer to the descriptions of each function for detailed information.
281
282
Georg Brandl9cea5112008-06-07 18:17:37 +0000283.. _st-objects:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000284
Georg Brandl9cea5112008-06-07 18:17:37 +0000285ST Objects
286----------
Georg Brandl8ec7f652007-08-15 14:28:01 +0000287
Georg Brandl9cea5112008-06-07 18:17:37 +0000288Ordered and equality comparisons are supported between ST objects. Pickling of
289ST objects (using the :mod:`pickle` module) is also supported.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000290
291
Georg Brandl9cea5112008-06-07 18:17:37 +0000292.. data:: STType
Georg Brandl8ec7f652007-08-15 14:28:01 +0000293
294 The type of the objects returned by :func:`expr`, :func:`suite` and
Georg Brandl9cea5112008-06-07 18:17:37 +0000295 :func:`sequence2st`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000296
Georg Brandl9cea5112008-06-07 18:17:37 +0000297ST objects have the following methods:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000298
299
Georg Brandl9cea5112008-06-07 18:17:37 +0000300.. method:: ST.compile([filename])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000301
Georg Brandl9cea5112008-06-07 18:17:37 +0000302 Same as ``compilest(st, filename)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000303
304
Georg Brandl9cea5112008-06-07 18:17:37 +0000305.. method:: ST.isexpr()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000306
Georg Brandl9cea5112008-06-07 18:17:37 +0000307 Same as ``isexpr(st)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000308
309
Georg Brandl9cea5112008-06-07 18:17:37 +0000310.. method:: ST.issuite()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000311
Georg Brandl9cea5112008-06-07 18:17:37 +0000312 Same as ``issuite(st)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000313
314
Georg Brandl9cea5112008-06-07 18:17:37 +0000315.. method:: ST.tolist([line_info])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000316
Georg Brandl9cea5112008-06-07 18:17:37 +0000317 Same as ``st2list(st, line_info)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000318
319
Georg Brandl9cea5112008-06-07 18:17:37 +0000320.. method:: ST.totuple([line_info])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000321
Georg Brandl9cea5112008-06-07 18:17:37 +0000322 Same as ``st2tuple(st, line_info)``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000323
324
Georg Brandl9cea5112008-06-07 18:17:37 +0000325.. _st-examples:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000326
327Examples
328--------
329
330.. index:: builtin: compile
331
332The parser modules allows operations to be performed on the parse tree of Python
Georg Brandl63fa1682007-10-21 10:24:20 +0000333source code before the :term:`bytecode` is generated, and provides for inspection of the
Georg Brandl8ec7f652007-08-15 14:28:01 +0000334parse tree for information gathering purposes. Two examples are presented. The
335simple example demonstrates emulation of the :func:`compile` built-in function
336and the complex example shows the use of a parse tree for information discovery.
337
338
339Emulation of :func:`compile`
340^^^^^^^^^^^^^^^^^^^^^^^^^^^^
341
342While many useful operations may take place between parsing and bytecode
343generation, the simplest operation is to do nothing. For this purpose, using
344the :mod:`parser` module to produce an intermediate data structure is equivalent
345to the code ::
346
347 >>> code = compile('a + 5', 'file.py', 'eval')
348 >>> a = 5
349 >>> eval(code)
350 10
351
352The equivalent operation using the :mod:`parser` module is somewhat longer, and
Georg Brandl9cea5112008-06-07 18:17:37 +0000353allows the intermediate internal parse tree to be retained as an ST object::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000354
355 >>> import parser
Georg Brandl9cea5112008-06-07 18:17:37 +0000356 >>> st = parser.expr('a + 5')
357 >>> code = st.compile('file.py')
Georg Brandl8ec7f652007-08-15 14:28:01 +0000358 >>> a = 5
359 >>> eval(code)
360 10
361
Georg Brandl9cea5112008-06-07 18:17:37 +0000362An application which needs both ST and code objects can package this code into
Georg Brandl8ec7f652007-08-15 14:28:01 +0000363readily available functions::
364
365 import parser
366
367 def load_suite(source_string):
Georg Brandl9cea5112008-06-07 18:17:37 +0000368 st = parser.suite(source_string)
369 return st, st.compile()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000370
371 def load_expression(source_string):
Georg Brandl9cea5112008-06-07 18:17:37 +0000372 st = parser.expr(source_string)
373 return st, st.compile()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000374
375
376Information Discovery
377^^^^^^^^^^^^^^^^^^^^^
378
379.. index::
380 single: string; documentation
381 single: docstrings
382
383Some applications benefit from direct access to the parse tree. The remainder
384of this section demonstrates how the parse tree provides access to module
385documentation defined in docstrings without requiring that the code being
386examined be loaded into a running interpreter via :keyword:`import`. This can
387be very useful for performing analyses of untrusted code.
388
389Generally, the example will demonstrate how the parse tree may be traversed to
390distill interesting information. Two functions and a set of classes are
391developed which provide programmatic access to high level function and class
392definitions provided by a module. The classes extract information from the
393parse tree and provide access to the information at a useful semantic level, one
394function provides a simple low-level pattern matching capability, and the other
395function defines a high-level interface to the classes by handling file
396operations on behalf of the caller. All source files mentioned here which are
397not part of the Python installation are located in the :file:`Demo/parser/`
398directory of the distribution.
399
400The dynamic nature of Python allows the programmer a great deal of flexibility,
401but most modules need only a limited measure of this when defining classes,
402functions, and methods. In this example, the only definitions that will be
403considered are those which are defined in the top level of their context, e.g.,
404a function defined by a :keyword:`def` statement at column zero of a module, but
405not a function defined within a branch of an :keyword:`if` ... :keyword:`else`
406construct, though there are some good reasons for doing so in some situations.
407Nesting of definitions will be handled by the code developed in the example.
408
409To construct the upper-level extraction methods, we need to know what the parse
410tree structure looks like and how much of it we actually need to be concerned
411about. Python uses a moderately deep parse tree so there are a large number of
412intermediate nodes. It is important to read and understand the formal grammar
413used by Python. This is specified in the file :file:`Grammar/Grammar` in the
414distribution. Consider the simplest case of interest when searching for
415docstrings: a module consisting of a docstring and nothing else. (See file
416:file:`docstring.py`.) ::
417
418 """Some documentation.
419 """
420
421Using the interpreter to take a look at the parse tree, we find a bewildering
422mass of numbers and parentheses, with the documentation buried deep in nested
423tuples. ::
424
425 >>> import parser
426 >>> import pprint
Georg Brandl9cea5112008-06-07 18:17:37 +0000427 >>> st = parser.suite(open('docstring.py').read())
428 >>> tup = st.totuple()
Georg Brandl8ec7f652007-08-15 14:28:01 +0000429 >>> pprint.pprint(tup)
430 (257,
431 (264,
432 (265,
433 (266,
434 (267,
435 (307,
436 (287,
437 (288,
438 (289,
439 (290,
440 (292,
441 (293,
442 (294,
443 (295,
444 (296,
445 (297,
446 (298,
447 (299,
448 (300, (3, '"""Some documentation.\n"""'))))))))))))))))),
449 (4, ''))),
450 (4, ''),
451 (0, ''))
452
453The numbers at the first element of each node in the tree are the node types;
454they map directly to terminal and non-terminal symbols in the grammar.
455Unfortunately, they are represented as integers in the internal representation,
456and the Python structures generated do not change that. However, the
457:mod:`symbol` and :mod:`token` modules provide symbolic names for the node types
458and dictionaries which map from the integers to the symbolic names for the node
459types.
460
461In the output presented above, the outermost tuple contains four elements: the
462integer ``257`` and three additional tuples. Node type ``257`` has the symbolic
463name :const:`file_input`. Each of these inner tuples contains an integer as the
464first element; these integers, ``264``, ``4``, and ``0``, represent the node
465types :const:`stmt`, :const:`NEWLINE`, and :const:`ENDMARKER`, respectively.
466Note that these values may change depending on the version of Python you are
467using; consult :file:`symbol.py` and :file:`token.py` for details of the
468mapping. It should be fairly clear that the outermost node is related primarily
469to the input source rather than the contents of the file, and may be disregarded
470for the moment. The :const:`stmt` node is much more interesting. In
471particular, all docstrings are found in subtrees which are formed exactly as
472this node is formed, with the only difference being the string itself. The
473association between the docstring in a similar tree and the defined entity
474(class, function, or module) which it describes is given by the position of the
475docstring subtree within the tree defining the described structure.
476
477By replacing the actual docstring with something to signify a variable component
478of the tree, we allow a simple pattern matching approach to check any given
479subtree for equivalence to the general pattern for docstrings. Since the
480example demonstrates information extraction, we can safely require that the tree
481be in tuple form rather than list form, allowing a simple variable
482representation to be ``['variable_name']``. A simple recursive function can
483implement the pattern matching, returning a Boolean and a dictionary of variable
484name to value mappings. (See file :file:`example.py`.) ::
485
486 from types import ListType, TupleType
487
488 def match(pattern, data, vars=None):
489 if vars is None:
490 vars = {}
491 if type(pattern) is ListType:
492 vars[pattern[0]] = data
493 return 1, vars
494 if type(pattern) is not TupleType:
495 return (pattern == data), vars
496 if len(data) != len(pattern):
497 return 0, vars
498 for pattern, data in map(None, pattern, data):
499 same, vars = match(pattern, data, vars)
500 if not same:
501 break
502 return same, vars
503
504Using this simple representation for syntactic variables and the symbolic node
505types, the pattern for the candidate docstring subtrees becomes fairly readable.
506(See file :file:`example.py`.) ::
507
508 import symbol
509 import token
510
511 DOCSTRING_STMT_PATTERN = (
512 symbol.stmt,
513 (symbol.simple_stmt,
514 (symbol.small_stmt,
515 (symbol.expr_stmt,
516 (symbol.testlist,
517 (symbol.test,
518 (symbol.and_test,
519 (symbol.not_test,
520 (symbol.comparison,
521 (symbol.expr,
522 (symbol.xor_expr,
523 (symbol.and_expr,
524 (symbol.shift_expr,
525 (symbol.arith_expr,
526 (symbol.term,
527 (symbol.factor,
528 (symbol.power,
529 (symbol.atom,
530 (token.STRING, ['docstring'])
531 )))))))))))))))),
532 (token.NEWLINE, '')
533 ))
534
535Using the :func:`match` function with this pattern, extracting the module
536docstring from the parse tree created previously is easy::
537
538 >>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1])
539 >>> found
540 1
541 >>> vars
542 {'docstring': '"""Some documentation.\n"""'}
543
544Once specific data can be extracted from a location where it is expected, the
545question of where information can be expected needs to be answered. When
546dealing with docstrings, the answer is fairly simple: the docstring is the first
547:const:`stmt` node in a code block (:const:`file_input` or :const:`suite` node
548types). A module consists of a single :const:`file_input` node, and class and
549function definitions each contain exactly one :const:`suite` node. Classes and
550functions are readily identified as subtrees of code block nodes which start
551with ``(stmt, (compound_stmt, (classdef, ...`` or ``(stmt, (compound_stmt,
552(funcdef, ...``. Note that these subtrees cannot be matched by :func:`match`
553since it does not support multiple sibling nodes to match without regard to
554number. A more elaborate matching function could be used to overcome this
555limitation, but this is sufficient for the example.
556
557Given the ability to determine whether a statement might be a docstring and
558extract the actual string from the statement, some work needs to be performed to
559walk the parse tree for an entire module and extract information about the names
560defined in each context of the module and associate any docstrings with the
561names. The code to perform this work is not complicated, but bears some
562explanation.
563
564The public interface to the classes is straightforward and should probably be
565somewhat more flexible. Each "major" block of the module is described by an
566object providing several methods for inquiry and a constructor which accepts at
567least the subtree of the complete parse tree which it represents. The
568:class:`ModuleInfo` constructor accepts an optional *name* parameter since it
569cannot otherwise determine the name of the module.
570
571The public classes include :class:`ClassInfo`, :class:`FunctionInfo`, and
572:class:`ModuleInfo`. All objects provide the methods :meth:`get_name`,
573:meth:`get_docstring`, :meth:`get_class_names`, and :meth:`get_class_info`. The
574:class:`ClassInfo` objects support :meth:`get_method_names` and
575:meth:`get_method_info` while the other classes provide
576:meth:`get_function_names` and :meth:`get_function_info`.
577
578Within each of the forms of code block that the public classes represent, most
579of the required information is in the same form and is accessed in the same way,
580with classes having the distinction that functions defined at the top level are
581referred to as "methods." Since the difference in nomenclature reflects a real
582semantic distinction from functions defined outside of a class, the
583implementation needs to maintain the distinction. Hence, most of the
584functionality of the public classes can be implemented in a common base class,
585:class:`SuiteInfoBase`, with the accessors for function and method information
586provided elsewhere. Note that there is only one class which represents function
587and method information; this parallels the use of the :keyword:`def` statement
588to define both types of elements.
589
590Most of the accessor functions are declared in :class:`SuiteInfoBase` and do not
591need to be overridden by subclasses. More importantly, the extraction of most
592information from a parse tree is handled through a method called by the
593:class:`SuiteInfoBase` constructor. The example code for most of the classes is
594clear when read alongside the formal grammar, but the method which recursively
595creates new information objects requires further examination. Here is the
596relevant part of the :class:`SuiteInfoBase` definition from :file:`example.py`::
597
598 class SuiteInfoBase:
599 _docstring = ''
600 _name = ''
601
602 def __init__(self, tree = None):
603 self._class_info = {}
604 self._function_info = {}
605 if tree:
606 self._extract_info(tree)
607
608 def _extract_info(self, tree):
609 # extract docstring
610 if len(tree) == 2:
611 found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1])
612 else:
613 found, vars = match(DOCSTRING_STMT_PATTERN, tree[3])
614 if found:
615 self._docstring = eval(vars['docstring'])
616 # discover inner definitions
617 for node in tree[1:]:
618 found, vars = match(COMPOUND_STMT_PATTERN, node)
619 if found:
620 cstmt = vars['compound']
621 if cstmt[0] == symbol.funcdef:
622 name = cstmt[2][1]
623 self._function_info[name] = FunctionInfo(cstmt)
624 elif cstmt[0] == symbol.classdef:
625 name = cstmt[2][1]
626 self._class_info[name] = ClassInfo(cstmt)
627
628After initializing some internal state, the constructor calls the
629:meth:`_extract_info` method. This method performs the bulk of the information
630extraction which takes place in the entire example. The extraction has two
631distinct phases: the location of the docstring for the parse tree passed in, and
632the discovery of additional definitions within the code block represented by the
633parse tree.
634
635The initial :keyword:`if` test determines whether the nested suite is of the
636"short form" or the "long form." The short form is used when the code block is
637on the same line as the definition of the code block, as in ::
638
639 def square(x): "Square an argument."; return x ** 2
640
641while the long form uses an indented block and allows nested definitions::
642
643 def make_power(exp):
Georg Brandlfc29f272009-01-02 20:25:14 +0000644 "Make a function that raises an argument to the exponent `exp`."
Georg Brandl8ec7f652007-08-15 14:28:01 +0000645 def raiser(x, y=exp):
646 return x ** y
647 return raiser
648
649When the short form is used, the code block may contain a docstring as the
650first, and possibly only, :const:`small_stmt` element. The extraction of such a
651docstring is slightly different and requires only a portion of the complete
652pattern used in the more common case. As implemented, the docstring will only
653be found if there is only one :const:`small_stmt` node in the
654:const:`simple_stmt` node. Since most functions and methods which use the short
655form do not provide a docstring, this may be considered sufficient. The
656extraction of the docstring proceeds using the :func:`match` function as
657described above, and the value of the docstring is stored as an attribute of the
658:class:`SuiteInfoBase` object.
659
660After docstring extraction, a simple definition discovery algorithm operates on
661the :const:`stmt` nodes of the :const:`suite` node. The special case of the
662short form is not tested; since there are no :const:`stmt` nodes in the short
663form, the algorithm will silently skip the single :const:`simple_stmt` node and
664correctly not discover any nested definitions.
665
666Each statement in the code block is categorized as a class definition, function
667or method definition, or something else. For the definition statements, the
668name of the element defined is extracted and a representation object appropriate
669to the definition is created with the defining subtree passed as an argument to
670the constructor. The representation objects are stored in instance variables
671and may be retrieved by name using the appropriate accessor methods.
672
673The public classes provide any accessors required which are more specific than
674those provided by the :class:`SuiteInfoBase` class, but the real extraction
675algorithm remains common to all forms of code blocks. A high-level function can
676be used to extract the complete set of information from a source file. (See
677file :file:`example.py`.) ::
678
679 def get_docs(fileName):
680 import os
681 import parser
682
683 source = open(fileName).read()
684 basename = os.path.basename(os.path.splitext(fileName)[0])
Georg Brandl9cea5112008-06-07 18:17:37 +0000685 st = parser.suite(source)
686 return ModuleInfo(st.totuple(), basename)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000687
688This provides an easy-to-use interface to the documentation of a module. If
689information is required which is not extracted by the code of this example, the
690code may be extended at clearly defined points to provide additional
691capabilities.
692