blob: efac1a5e8e6cba549b60b3c3cfb6df0252973c13 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`parser` --- Access Python parse trees
2===========================================
3
4.. module:: parser
5 :synopsis: Access parse trees for Python source code.
6.. moduleauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
7.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
8
9
Christian Heimes5b5e81c2007-12-31 16:14:33 +000010.. Copyright 1995 Virginia Polytechnic Institute and State University and Fred
11 L. Drake, Jr. This copyright notice must be distributed on all copies, but
12 this document otherwise may be distributed as part of the Python
13 distribution. No fee may be charged for this document in any representation,
14 either on paper or electronically. This restriction does not affect other
15 elements in a distributed package in any way.
Georg Brandl116aa622007-08-15 14:28:22 +000016
17.. index:: single: parsing; Python source code
18
19The :mod:`parser` module provides an interface to Python's internal parser and
20byte-code compiler. The primary purpose for this interface is to allow Python
21code to edit the parse tree of a Python expression and create executable code
22from this. This is better than trying to parse and modify an arbitrary Python
23code fragment as a string because parsing is performed in a manner identical to
24the code forming the application. It is also faster.
25
Georg Brandl0c77a822008-06-10 16:37:50 +000026.. note::
27
28 From Python 2.5 onward, it's much more convenient to cut in at the Abstract
29 Syntax Tree (AST) generation and compilation stage, using the :mod:`ast`
30 module.
31
Georg Brandl116aa622007-08-15 14:28:22 +000032There are a few things to note about this module which are important to making
33use of the data structures created. This is not a tutorial on editing the parse
34trees for Python code, but some examples of using the :mod:`parser` module are
35presented.
36
37Most importantly, a good understanding of the Python grammar processed by the
38internal parser is required. For full information on the language syntax, refer
39to :ref:`reference-index`. The parser
40itself is created from a grammar specification defined in the file
41:file:`Grammar/Grammar` in the standard Python distribution. The parse trees
Georg Brandl0c77a822008-06-10 16:37:50 +000042stored in the ST objects created by this module are the actual output from the
Georg Brandl116aa622007-08-15 14:28:22 +000043internal parser when created by the :func:`expr` or :func:`suite` functions,
Georg Brandl0c77a822008-06-10 16:37:50 +000044described below. The ST objects created by :func:`sequence2st` faithfully
Georg Brandl116aa622007-08-15 14:28:22 +000045simulate those structures. Be aware that the values of the sequences which are
46considered "correct" will vary from one version of Python to another as the
47formal grammar for the language is revised. However, transporting code from one
48Python version to another as source text will always allow correct parse trees
49to be created in the target version, with the only restriction being that
50migrating to an older version of the interpreter will not support more recent
51language constructs. The parse trees are not typically compatible from one
52version to another, whereas source code has always been forward-compatible.
53
Georg Brandl0c77a822008-06-10 16:37:50 +000054Each element of the sequences returned by :func:`st2list` or :func:`st2tuple`
Georg Brandl116aa622007-08-15 14:28:22 +000055has a simple form. Sequences representing non-terminal elements in the grammar
56always have a length greater than one. The first element is an integer which
57identifies a production in the grammar. These integers are given symbolic names
58in the C header file :file:`Include/graminit.h` and the Python module
59:mod:`symbol`. Each additional element of the sequence represents a component
60of the production as recognized in the input string: these are always sequences
61which have the same form as the parent. An important aspect of this structure
62which should be noted is that keywords used to identify the parent node type,
63such as the keyword :keyword:`if` in an :const:`if_stmt`, are included in the
64node tree without any special treatment. For example, the :keyword:`if` keyword
65is represented by the tuple ``(1, 'if')``, where ``1`` is the numeric value
66associated with all :const:`NAME` tokens, including variable and function names
67defined by the user. In an alternate form returned when line number information
68is requested, the same token might be represented as ``(1, 'if', 12)``, where
69the ``12`` represents the line number at which the terminal symbol was found.
70
71Terminal elements are represented in much the same way, but without any child
72elements and the addition of the source text which was identified. The example
73of the :keyword:`if` keyword above is representative. The various types of
74terminal symbols are defined in the C header file :file:`Include/token.h` and
75the Python module :mod:`token`.
76
Georg Brandl0c77a822008-06-10 16:37:50 +000077The ST objects are not required to support the functionality of this module,
Georg Brandl116aa622007-08-15 14:28:22 +000078but are provided for three purposes: to allow an application to amortize the
79cost of processing complex parse trees, to provide a parse tree representation
80which conserves memory space when compared to the Python list or tuple
81representation, and to ease the creation of additional modules in C which
82manipulate parse trees. A simple "wrapper" class may be created in Python to
Georg Brandl0c77a822008-06-10 16:37:50 +000083hide the use of ST objects.
Georg Brandl116aa622007-08-15 14:28:22 +000084
85The :mod:`parser` module defines functions for a few distinct purposes. The
Georg Brandl0c77a822008-06-10 16:37:50 +000086most important purposes are to create ST objects and to convert ST objects to
Georg Brandl116aa622007-08-15 14:28:22 +000087other representations such as parse trees and compiled code objects, but there
88are also functions which serve to query the type of parse tree represented by an
Georg Brandl0c77a822008-06-10 16:37:50 +000089ST object.
Georg Brandl116aa622007-08-15 14:28:22 +000090
91
92.. seealso::
93
94 Module :mod:`symbol`
95 Useful constants representing internal nodes of the parse tree.
96
97 Module :mod:`token`
98 Useful constants representing leaf nodes of the parse tree and functions for
99 testing node values.
100
101
Georg Brandl0c77a822008-06-10 16:37:50 +0000102.. _creating-sts:
Georg Brandl116aa622007-08-15 14:28:22 +0000103
Georg Brandl0c77a822008-06-10 16:37:50 +0000104Creating ST Objects
105-------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000106
Georg Brandl0c77a822008-06-10 16:37:50 +0000107ST objects may be created from source code or from a parse tree. When creating
108an ST object from source, different functions are used to create the ``'eval'``
Georg Brandl116aa622007-08-15 14:28:22 +0000109and ``'exec'`` forms.
110
111
112.. function:: expr(source)
113
114 The :func:`expr` function parses the parameter *source* as if it were an input
Georg Brandl0c77a822008-06-10 16:37:50 +0000115 to ``compile(source, 'file.py', 'eval')``. If the parse succeeds, an ST object
Georg Brandl116aa622007-08-15 14:28:22 +0000116 is created to hold the internal parse tree representation, otherwise an
Georg Brandl7cb13192010-08-03 12:06:29 +0000117 appropriate exception is raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000118
119
120.. function:: suite(source)
121
122 The :func:`suite` function parses the parameter *source* as if it were an input
Georg Brandl0c77a822008-06-10 16:37:50 +0000123 to ``compile(source, 'file.py', 'exec')``. If the parse succeeds, an ST object
Georg Brandl116aa622007-08-15 14:28:22 +0000124 is created to hold the internal parse tree representation, otherwise an
Georg Brandl7cb13192010-08-03 12:06:29 +0000125 appropriate exception is raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000126
127
Georg Brandl0c77a822008-06-10 16:37:50 +0000128.. function:: sequence2st(sequence)
Georg Brandl116aa622007-08-15 14:28:22 +0000129
130 This function accepts a parse tree represented as a sequence and builds an
131 internal representation if possible. If it can validate that the tree conforms
132 to the Python grammar and all nodes are valid node types in the host version of
Georg Brandl0c77a822008-06-10 16:37:50 +0000133 Python, an ST object is created from the internal representation and returned
Georg Brandl116aa622007-08-15 14:28:22 +0000134 to the called. If there is a problem creating the internal representation, or
Georg Brandl7cb13192010-08-03 12:06:29 +0000135 if the tree cannot be validated, a :exc:`ParserError` exception is raised. An
Georg Brandl0c77a822008-06-10 16:37:50 +0000136 ST object created this way should not be assumed to compile correctly; normal
Georg Brandl7cb13192010-08-03 12:06:29 +0000137 exceptions raised by compilation may still be initiated when the ST object is
Georg Brandl0c77a822008-06-10 16:37:50 +0000138 passed to :func:`compilest`. This may indicate problems not related to syntax
Georg Brandl116aa622007-08-15 14:28:22 +0000139 (such as a :exc:`MemoryError` exception), but may also be due to constructs such
140 as the result of parsing ``del f(0)``, which escapes the Python parser but is
141 checked by the bytecode compiler.
142
143 Sequences representing terminal tokens may be represented as either two-element
144 lists of the form ``(1, 'name')`` or as three-element lists of the form ``(1,
145 'name', 56)``. If the third element is present, it is assumed to be a valid
146 line number. The line number may be specified for any subset of the terminal
147 symbols in the input tree.
148
149
Georg Brandl0c77a822008-06-10 16:37:50 +0000150.. function:: tuple2st(sequence)
Georg Brandl116aa622007-08-15 14:28:22 +0000151
Georg Brandl0c77a822008-06-10 16:37:50 +0000152 This is the same function as :func:`sequence2st`. This entry point is
Georg Brandl116aa622007-08-15 14:28:22 +0000153 maintained for backward compatibility.
154
155
Georg Brandl0c77a822008-06-10 16:37:50 +0000156.. _converting-sts:
Georg Brandl116aa622007-08-15 14:28:22 +0000157
Georg Brandl0c77a822008-06-10 16:37:50 +0000158Converting ST Objects
159---------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000160
Georg Brandl0c77a822008-06-10 16:37:50 +0000161ST objects, regardless of the input used to create them, may be converted to
Georg Brandl116aa622007-08-15 14:28:22 +0000162parse trees represented as list- or tuple- trees, or may be compiled into
163executable code objects. Parse trees may be extracted with or without line
164numbering information.
165
166
Georg Brandl18244152009-09-02 20:34:52 +0000167.. function:: st2list(st, line_info=False, col_info=False)
Georg Brandl116aa622007-08-15 14:28:22 +0000168
Georg Brandl30704ea02008-07-23 15:07:12 +0000169 This function accepts an ST object from the caller in *st* and returns a
Georg Brandl116aa622007-08-15 14:28:22 +0000170 Python list representing the equivalent parse tree. The resulting list
171 representation can be used for inspection or the creation of a new parse tree in
172 list form. This function does not fail so long as memory is available to build
173 the list representation. If the parse tree will only be used for inspection,
Georg Brandl0c77a822008-06-10 16:37:50 +0000174 :func:`st2tuple` should be used instead to reduce memory consumption and
Georg Brandl116aa622007-08-15 14:28:22 +0000175 fragmentation. When the list representation is required, this function is
176 significantly faster than retrieving a tuple representation and converting that
177 to nested lists.
178
179 If *line_info* is true, line number information will be included for all
180 terminal tokens as a third element of the list representing the token. Note
181 that the line number provided specifies the line on which the token *ends*.
182 This information is omitted if the flag is false or omitted.
183
184
Georg Brandl18244152009-09-02 20:34:52 +0000185.. function:: st2tuple(st, line_info=False, col_info=False)
Georg Brandl116aa622007-08-15 14:28:22 +0000186
Georg Brandl30704ea02008-07-23 15:07:12 +0000187 This function accepts an ST object from the caller in *st* and returns a
Georg Brandl116aa622007-08-15 14:28:22 +0000188 Python tuple representing the equivalent parse tree. Other than returning a
Georg Brandl0c77a822008-06-10 16:37:50 +0000189 tuple instead of a list, this function is identical to :func:`st2list`.
Georg Brandl116aa622007-08-15 14:28:22 +0000190
191 If *line_info* is true, line number information will be included for all
192 terminal tokens as a third element of the list representing the token. This
193 information is omitted if the flag is false or omitted.
194
195
Georg Brandl18244152009-09-02 20:34:52 +0000196.. function:: compilest(st, filename='<syntax-tree>')
Georg Brandl116aa622007-08-15 14:28:22 +0000197
198 .. index::
199 builtin: exec
200 builtin: eval
201
Georg Brandl0c77a822008-06-10 16:37:50 +0000202 The Python byte compiler can be invoked on an ST object to produce code objects
Georg Brandl116aa622007-08-15 14:28:22 +0000203 which can be used as part of a call to the built-in :func:`exec` or :func:`eval`
204 functions. This function provides the interface to the compiler, passing the
Georg Brandl30704ea02008-07-23 15:07:12 +0000205 internal parse tree from *st* to the parser, using the source file name
Georg Brandl116aa622007-08-15 14:28:22 +0000206 specified by the *filename* parameter. The default value supplied for *filename*
Georg Brandl0c77a822008-06-10 16:37:50 +0000207 indicates that the source was an ST object.
Georg Brandl116aa622007-08-15 14:28:22 +0000208
Georg Brandl0c77a822008-06-10 16:37:50 +0000209 Compiling an ST object may result in exceptions related to compilation; an
Georg Brandl116aa622007-08-15 14:28:22 +0000210 example would be a :exc:`SyntaxError` caused by the parse tree for ``del f(0)``:
211 this statement is considered legal within the formal grammar for Python but is
212 not a legal language construct. The :exc:`SyntaxError` raised for this
213 condition is actually generated by the Python byte-compiler normally, which is
214 why it can be raised at this point by the :mod:`parser` module. Most causes of
215 compilation failure can be diagnosed programmatically by inspection of the parse
216 tree.
217
218
Georg Brandl0c77a822008-06-10 16:37:50 +0000219.. _querying-sts:
Georg Brandl116aa622007-08-15 14:28:22 +0000220
Georg Brandl0c77a822008-06-10 16:37:50 +0000221Queries on ST Objects
222---------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000223
Georg Brandl0c77a822008-06-10 16:37:50 +0000224Two functions are provided which allow an application to determine if an ST was
Georg Brandl116aa622007-08-15 14:28:22 +0000225created as an expression or a suite. Neither of these functions can be used to
Georg Brandl0c77a822008-06-10 16:37:50 +0000226determine if an ST was created from source code via :func:`expr` or
227:func:`suite` or from a parse tree via :func:`sequence2st`.
Georg Brandl116aa622007-08-15 14:28:22 +0000228
229
Georg Brandl30704ea02008-07-23 15:07:12 +0000230.. function:: isexpr(st)
Georg Brandl116aa622007-08-15 14:28:22 +0000231
232 .. index:: builtin: compile
233
Georg Brandl30704ea02008-07-23 15:07:12 +0000234 When *st* represents an ``'eval'`` form, this function returns true, otherwise
Georg Brandl116aa622007-08-15 14:28:22 +0000235 it returns false. This is useful, since code objects normally cannot be queried
236 for this information using existing built-in functions. Note that the code
Georg Brandl0c77a822008-06-10 16:37:50 +0000237 objects created by :func:`compilest` cannot be queried like this either, and
Georg Brandl116aa622007-08-15 14:28:22 +0000238 are identical to those created by the built-in :func:`compile` function.
239
240
Georg Brandl30704ea02008-07-23 15:07:12 +0000241.. function:: issuite(st)
Georg Brandl116aa622007-08-15 14:28:22 +0000242
Georg Brandl0c77a822008-06-10 16:37:50 +0000243 This function mirrors :func:`isexpr` in that it reports whether an ST object
Georg Brandl116aa622007-08-15 14:28:22 +0000244 represents an ``'exec'`` form, commonly known as a "suite." It is not safe to
Georg Brandl30704ea02008-07-23 15:07:12 +0000245 assume that this function is equivalent to ``not isexpr(st)``, as additional
Georg Brandl116aa622007-08-15 14:28:22 +0000246 syntactic fragments may be supported in the future.
247
248
Georg Brandl0c77a822008-06-10 16:37:50 +0000249.. _st-errors:
Georg Brandl116aa622007-08-15 14:28:22 +0000250
251Exceptions and Error Handling
252-----------------------------
253
254The parser module defines a single exception, but may also pass other built-in
255exceptions from other portions of the Python runtime environment. See each
256function for information about the exceptions it can raise.
257
258
259.. exception:: ParserError
260
261 Exception raised when a failure occurs within the parser module. This is
Georg Brandl7cb13192010-08-03 12:06:29 +0000262 generally produced for validation failures rather than the built-in
263 :exc:`SyntaxError` raised during normal parsing. The exception argument is
Georg Brandl116aa622007-08-15 14:28:22 +0000264 either a string describing the reason of the failure or a tuple containing a
Georg Brandl0c77a822008-06-10 16:37:50 +0000265 sequence causing the failure from a parse tree passed to :func:`sequence2st`
266 and an explanatory string. Calls to :func:`sequence2st` need to be able to
Georg Brandl116aa622007-08-15 14:28:22 +0000267 handle either type of exception, while calls to other functions in the module
268 will only need to be aware of the simple string values.
269
Georg Brandl0c77a822008-06-10 16:37:50 +0000270Note that the functions :func:`compilest`, :func:`expr`, and :func:`suite` may
Georg Brandl7cb13192010-08-03 12:06:29 +0000271raise exceptions which are normally thrown by the parsing and compilation
Georg Brandl116aa622007-08-15 14:28:22 +0000272process. These include the built in exceptions :exc:`MemoryError`,
273:exc:`OverflowError`, :exc:`SyntaxError`, and :exc:`SystemError`. In these
274cases, these exceptions carry all the meaning normally associated with them.
275Refer to the descriptions of each function for detailed information.
276
277
Georg Brandl0c77a822008-06-10 16:37:50 +0000278.. _st-objects:
Georg Brandl116aa622007-08-15 14:28:22 +0000279
Georg Brandl0c77a822008-06-10 16:37:50 +0000280ST Objects
281----------
Georg Brandl116aa622007-08-15 14:28:22 +0000282
Georg Brandl0c77a822008-06-10 16:37:50 +0000283Ordered and equality comparisons are supported between ST objects. Pickling of
284ST objects (using the :mod:`pickle` module) is also supported.
Georg Brandl116aa622007-08-15 14:28:22 +0000285
286
Georg Brandl0c77a822008-06-10 16:37:50 +0000287.. data:: STType
Georg Brandl116aa622007-08-15 14:28:22 +0000288
289 The type of the objects returned by :func:`expr`, :func:`suite` and
Georg Brandl0c77a822008-06-10 16:37:50 +0000290 :func:`sequence2st`.
Georg Brandl116aa622007-08-15 14:28:22 +0000291
Georg Brandl0c77a822008-06-10 16:37:50 +0000292ST objects have the following methods:
Georg Brandl116aa622007-08-15 14:28:22 +0000293
294
Georg Brandl18244152009-09-02 20:34:52 +0000295.. method:: ST.compile(filename='<syntax-tree>')
Georg Brandl116aa622007-08-15 14:28:22 +0000296
Georg Brandl0c77a822008-06-10 16:37:50 +0000297 Same as ``compilest(st, filename)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000298
299
Georg Brandl0c77a822008-06-10 16:37:50 +0000300.. method:: ST.isexpr()
Georg Brandl116aa622007-08-15 14:28:22 +0000301
Georg Brandl0c77a822008-06-10 16:37:50 +0000302 Same as ``isexpr(st)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000303
304
Georg Brandl0c77a822008-06-10 16:37:50 +0000305.. method:: ST.issuite()
Georg Brandl116aa622007-08-15 14:28:22 +0000306
Georg Brandl0c77a822008-06-10 16:37:50 +0000307 Same as ``issuite(st)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000308
309
Georg Brandl18244152009-09-02 20:34:52 +0000310.. method:: ST.tolist(line_info=False, col_info=False)
Georg Brandl116aa622007-08-15 14:28:22 +0000311
Georg Brandl18244152009-09-02 20:34:52 +0000312 Same as ``st2list(st, line_info, col_info)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000313
314
Georg Brandl18244152009-09-02 20:34:52 +0000315.. method:: ST.totuple(line_info=False, col_info=False)
Georg Brandl116aa622007-08-15 14:28:22 +0000316
Georg Brandl18244152009-09-02 20:34:52 +0000317 Same as ``st2tuple(st, line_info, col_info)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000318
319
Georg Brandl0c77a822008-06-10 16:37:50 +0000320.. _st-examples:
Georg Brandl116aa622007-08-15 14:28:22 +0000321
322Examples
323--------
324
325.. index:: builtin: compile
326
327The parser modules allows operations to be performed on the parse tree of Python
Georg Brandl9afde1c2007-11-01 20:32:30 +0000328source code before the :term:`bytecode` is generated, and provides for inspection of the
Georg Brandl116aa622007-08-15 14:28:22 +0000329parse tree for information gathering purposes. Two examples are presented. The
330simple example demonstrates emulation of the :func:`compile` built-in function
331and the complex example shows the use of a parse tree for information discovery.
332
333
334Emulation of :func:`compile`
335^^^^^^^^^^^^^^^^^^^^^^^^^^^^
336
337While many useful operations may take place between parsing and bytecode
338generation, the simplest operation is to do nothing. For this purpose, using
339the :mod:`parser` module to produce an intermediate data structure is equivalent
340to the code ::
341
342 >>> code = compile('a + 5', 'file.py', 'eval')
343 >>> a = 5
344 >>> eval(code)
345 10
346
347The equivalent operation using the :mod:`parser` module is somewhat longer, and
Georg Brandl0c77a822008-06-10 16:37:50 +0000348allows the intermediate internal parse tree to be retained as an ST object::
Georg Brandl116aa622007-08-15 14:28:22 +0000349
350 >>> import parser
Georg Brandl0c77a822008-06-10 16:37:50 +0000351 >>> st = parser.expr('a + 5')
352 >>> code = st.compile('file.py')
Georg Brandl116aa622007-08-15 14:28:22 +0000353 >>> a = 5
354 >>> eval(code)
355 10
356
Georg Brandl0c77a822008-06-10 16:37:50 +0000357An application which needs both ST and code objects can package this code into
Georg Brandl116aa622007-08-15 14:28:22 +0000358readily available functions::
359
360 import parser
361
362 def load_suite(source_string):
Georg Brandl0c77a822008-06-10 16:37:50 +0000363 st = parser.suite(source_string)
364 return st, st.compile()
Georg Brandl116aa622007-08-15 14:28:22 +0000365
366 def load_expression(source_string):
Georg Brandl0c77a822008-06-10 16:37:50 +0000367 st = parser.expr(source_string)
368 return st, st.compile()
Georg Brandl116aa622007-08-15 14:28:22 +0000369
370
371Information Discovery
372^^^^^^^^^^^^^^^^^^^^^
373
374.. index::
375 single: string; documentation
376 single: docstrings
377
378Some applications benefit from direct access to the parse tree. The remainder
379of this section demonstrates how the parse tree provides access to module
380documentation defined in docstrings without requiring that the code being
381examined be loaded into a running interpreter via :keyword:`import`. This can
382be very useful for performing analyses of untrusted code.
383
384Generally, the example will demonstrate how the parse tree may be traversed to
385distill interesting information. Two functions and a set of classes are
386developed which provide programmatic access to high level function and class
387definitions provided by a module. The classes extract information from the
388parse tree and provide access to the information at a useful semantic level, one
389function provides a simple low-level pattern matching capability, and the other
390function defines a high-level interface to the classes by handling file
391operations on behalf of the caller. All source files mentioned here which are
392not part of the Python installation are located in the :file:`Demo/parser/`
393directory of the distribution.
394
395The dynamic nature of Python allows the programmer a great deal of flexibility,
396but most modules need only a limited measure of this when defining classes,
397functions, and methods. In this example, the only definitions that will be
398considered are those which are defined in the top level of their context, e.g.,
399a function defined by a :keyword:`def` statement at column zero of a module, but
400not a function defined within a branch of an :keyword:`if` ... :keyword:`else`
401construct, though there are some good reasons for doing so in some situations.
402Nesting of definitions will be handled by the code developed in the example.
403
404To construct the upper-level extraction methods, we need to know what the parse
405tree structure looks like and how much of it we actually need to be concerned
406about. Python uses a moderately deep parse tree so there are a large number of
407intermediate nodes. It is important to read and understand the formal grammar
408used by Python. This is specified in the file :file:`Grammar/Grammar` in the
409distribution. Consider the simplest case of interest when searching for
410docstrings: a module consisting of a docstring and nothing else. (See file
411:file:`docstring.py`.) ::
412
413 """Some documentation.
414 """
415
416Using the interpreter to take a look at the parse tree, we find a bewildering
417mass of numbers and parentheses, with the documentation buried deep in nested
418tuples. ::
419
420 >>> import parser
421 >>> import pprint
Georg Brandl0c77a822008-06-10 16:37:50 +0000422 >>> st = parser.suite(open('docstring.py').read())
423 >>> tup = st.totuple()
Georg Brandl116aa622007-08-15 14:28:22 +0000424 >>> pprint.pprint(tup)
425 (257,
426 (264,
427 (265,
428 (266,
429 (267,
430 (307,
431 (287,
432 (288,
433 (289,
434 (290,
435 (292,
436 (293,
437 (294,
438 (295,
439 (296,
440 (297,
441 (298,
442 (299,
443 (300, (3, '"""Some documentation.\n"""'))))))))))))))))),
444 (4, ''))),
445 (4, ''),
446 (0, ''))
447
448The numbers at the first element of each node in the tree are the node types;
449they map directly to terminal and non-terminal symbols in the grammar.
450Unfortunately, they are represented as integers in the internal representation,
451and the Python structures generated do not change that. However, the
452:mod:`symbol` and :mod:`token` modules provide symbolic names for the node types
453and dictionaries which map from the integers to the symbolic names for the node
454types.
455
456In the output presented above, the outermost tuple contains four elements: the
457integer ``257`` and three additional tuples. Node type ``257`` has the symbolic
458name :const:`file_input`. Each of these inner tuples contains an integer as the
459first element; these integers, ``264``, ``4``, and ``0``, represent the node
460types :const:`stmt`, :const:`NEWLINE`, and :const:`ENDMARKER`, respectively.
461Note that these values may change depending on the version of Python you are
462using; consult :file:`symbol.py` and :file:`token.py` for details of the
463mapping. It should be fairly clear that the outermost node is related primarily
464to the input source rather than the contents of the file, and may be disregarded
465for the moment. The :const:`stmt` node is much more interesting. In
466particular, all docstrings are found in subtrees which are formed exactly as
467this node is formed, with the only difference being the string itself. The
468association between the docstring in a similar tree and the defined entity
469(class, function, or module) which it describes is given by the position of the
470docstring subtree within the tree defining the described structure.
471
472By replacing the actual docstring with something to signify a variable component
473of the tree, we allow a simple pattern matching approach to check any given
474subtree for equivalence to the general pattern for docstrings. Since the
475example demonstrates information extraction, we can safely require that the tree
476be in tuple form rather than list form, allowing a simple variable
477representation to be ``['variable_name']``. A simple recursive function can
478implement the pattern matching, returning a Boolean and a dictionary of variable
479name to value mappings. (See file :file:`example.py`.) ::
480
Georg Brandl116aa622007-08-15 14:28:22 +0000481 def match(pattern, data, vars=None):
482 if vars is None:
483 vars = {}
Collin Winter1b1498b2007-08-28 06:10:19 +0000484 if isinstance(pattern, list):
Georg Brandl116aa622007-08-15 14:28:22 +0000485 vars[pattern[0]] = data
Collin Winter1b1498b2007-08-28 06:10:19 +0000486 return True, vars
487 if not instance(pattern, tuple):
Georg Brandl116aa622007-08-15 14:28:22 +0000488 return (pattern == data), vars
489 if len(data) != len(pattern):
Collin Winter1b1498b2007-08-28 06:10:19 +0000490 return False, vars
491 for pattern, data in zip(pattern, data):
Georg Brandl116aa622007-08-15 14:28:22 +0000492 same, vars = match(pattern, data, vars)
493 if not same:
494 break
495 return same, vars
496
497Using this simple representation for syntactic variables and the symbolic node
498types, the pattern for the candidate docstring subtrees becomes fairly readable.
499(See file :file:`example.py`.) ::
500
501 import symbol
502 import token
503
504 DOCSTRING_STMT_PATTERN = (
505 symbol.stmt,
506 (symbol.simple_stmt,
507 (symbol.small_stmt,
508 (symbol.expr_stmt,
509 (symbol.testlist,
510 (symbol.test,
511 (symbol.and_test,
512 (symbol.not_test,
513 (symbol.comparison,
514 (symbol.expr,
515 (symbol.xor_expr,
516 (symbol.and_expr,
517 (symbol.shift_expr,
518 (symbol.arith_expr,
519 (symbol.term,
520 (symbol.factor,
521 (symbol.power,
522 (symbol.atom,
523 (token.STRING, ['docstring'])
524 )))))))))))))))),
525 (token.NEWLINE, '')
526 ))
527
528Using the :func:`match` function with this pattern, extracting the module
529docstring from the parse tree created previously is easy::
530
531 >>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1])
532 >>> found
Collin Winter1b1498b2007-08-28 06:10:19 +0000533 True
Georg Brandl116aa622007-08-15 14:28:22 +0000534 >>> vars
535 {'docstring': '"""Some documentation.\n"""'}
536
537Once specific data can be extracted from a location where it is expected, the
538question of where information can be expected needs to be answered. When
539dealing with docstrings, the answer is fairly simple: the docstring is the first
540:const:`stmt` node in a code block (:const:`file_input` or :const:`suite` node
541types). A module consists of a single :const:`file_input` node, and class and
542function definitions each contain exactly one :const:`suite` node. Classes and
543functions are readily identified as subtrees of code block nodes which start
544with ``(stmt, (compound_stmt, (classdef, ...`` or ``(stmt, (compound_stmt,
545(funcdef, ...``. Note that these subtrees cannot be matched by :func:`match`
546since it does not support multiple sibling nodes to match without regard to
547number. A more elaborate matching function could be used to overcome this
548limitation, but this is sufficient for the example.
549
550Given the ability to determine whether a statement might be a docstring and
551extract the actual string from the statement, some work needs to be performed to
552walk the parse tree for an entire module and extract information about the names
553defined in each context of the module and associate any docstrings with the
554names. The code to perform this work is not complicated, but bears some
555explanation.
556
557The public interface to the classes is straightforward and should probably be
558somewhat more flexible. Each "major" block of the module is described by an
559object providing several methods for inquiry and a constructor which accepts at
560least the subtree of the complete parse tree which it represents. The
561:class:`ModuleInfo` constructor accepts an optional *name* parameter since it
562cannot otherwise determine the name of the module.
563
564The public classes include :class:`ClassInfo`, :class:`FunctionInfo`, and
565:class:`ModuleInfo`. All objects provide the methods :meth:`get_name`,
566:meth:`get_docstring`, :meth:`get_class_names`, and :meth:`get_class_info`. The
567:class:`ClassInfo` objects support :meth:`get_method_names` and
568:meth:`get_method_info` while the other classes provide
569:meth:`get_function_names` and :meth:`get_function_info`.
570
571Within each of the forms of code block that the public classes represent, most
572of the required information is in the same form and is accessed in the same way,
573with classes having the distinction that functions defined at the top level are
574referred to as "methods." Since the difference in nomenclature reflects a real
575semantic distinction from functions defined outside of a class, the
576implementation needs to maintain the distinction. Hence, most of the
577functionality of the public classes can be implemented in a common base class,
578:class:`SuiteInfoBase`, with the accessors for function and method information
579provided elsewhere. Note that there is only one class which represents function
580and method information; this parallels the use of the :keyword:`def` statement
581to define both types of elements.
582
583Most of the accessor functions are declared in :class:`SuiteInfoBase` and do not
584need to be overridden by subclasses. More importantly, the extraction of most
585information from a parse tree is handled through a method called by the
586:class:`SuiteInfoBase` constructor. The example code for most of the classes is
587clear when read alongside the formal grammar, but the method which recursively
588creates new information objects requires further examination. Here is the
589relevant part of the :class:`SuiteInfoBase` definition from :file:`example.py`::
590
591 class SuiteInfoBase:
592 _docstring = ''
593 _name = ''
594
595 def __init__(self, tree = None):
596 self._class_info = {}
597 self._function_info = {}
598 if tree:
599 self._extract_info(tree)
600
601 def _extract_info(self, tree):
602 # extract docstring
603 if len(tree) == 2:
604 found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1])
605 else:
606 found, vars = match(DOCSTRING_STMT_PATTERN, tree[3])
607 if found:
608 self._docstring = eval(vars['docstring'])
609 # discover inner definitions
610 for node in tree[1:]:
611 found, vars = match(COMPOUND_STMT_PATTERN, node)
612 if found:
613 cstmt = vars['compound']
614 if cstmt[0] == symbol.funcdef:
615 name = cstmt[2][1]
616 self._function_info[name] = FunctionInfo(cstmt)
617 elif cstmt[0] == symbol.classdef:
618 name = cstmt[2][1]
619 self._class_info[name] = ClassInfo(cstmt)
620
621After initializing some internal state, the constructor calls the
622:meth:`_extract_info` method. This method performs the bulk of the information
623extraction which takes place in the entire example. The extraction has two
624distinct phases: the location of the docstring for the parse tree passed in, and
625the discovery of additional definitions within the code block represented by the
626parse tree.
627
628The initial :keyword:`if` test determines whether the nested suite is of the
629"short form" or the "long form." The short form is used when the code block is
630on the same line as the definition of the code block, as in ::
631
632 def square(x): "Square an argument."; return x ** 2
633
634while the long form uses an indented block and allows nested definitions::
635
636 def make_power(exp):
Georg Brandl1f01deb2009-01-03 22:47:39 +0000637 "Make a function that raises an argument to the exponent `exp`."
Georg Brandl116aa622007-08-15 14:28:22 +0000638 def raiser(x, y=exp):
639 return x ** y
640 return raiser
641
642When the short form is used, the code block may contain a docstring as the
643first, and possibly only, :const:`small_stmt` element. The extraction of such a
644docstring is slightly different and requires only a portion of the complete
645pattern used in the more common case. As implemented, the docstring will only
646be found if there is only one :const:`small_stmt` node in the
647:const:`simple_stmt` node. Since most functions and methods which use the short
648form do not provide a docstring, this may be considered sufficient. The
649extraction of the docstring proceeds using the :func:`match` function as
650described above, and the value of the docstring is stored as an attribute of the
651:class:`SuiteInfoBase` object.
652
653After docstring extraction, a simple definition discovery algorithm operates on
654the :const:`stmt` nodes of the :const:`suite` node. The special case of the
655short form is not tested; since there are no :const:`stmt` nodes in the short
656form, the algorithm will silently skip the single :const:`simple_stmt` node and
657correctly not discover any nested definitions.
658
659Each statement in the code block is categorized as a class definition, function
660or method definition, or something else. For the definition statements, the
661name of the element defined is extracted and a representation object appropriate
662to the definition is created with the defining subtree passed as an argument to
663the constructor. The representation objects are stored in instance variables
664and may be retrieved by name using the appropriate accessor methods.
665
666The public classes provide any accessors required which are more specific than
667those provided by the :class:`SuiteInfoBase` class, but the real extraction
668algorithm remains common to all forms of code blocks. A high-level function can
669be used to extract the complete set of information from a source file. (See
670file :file:`example.py`.) ::
671
672 def get_docs(fileName):
673 import os
674 import parser
675
676 source = open(fileName).read()
677 basename = os.path.basename(os.path.splitext(fileName)[0])
Georg Brandl0c77a822008-06-10 16:37:50 +0000678 st = parser.suite(source)
679 return ModuleInfo(st.totuple(), basename)
Georg Brandl116aa622007-08-15 14:28:22 +0000680
681This provides an easy-to-use interface to the documentation of a module. If
682information is required which is not extracted by the code of this example, the
683code may be extended at clearly defined points to provide additional
684capabilities.
685