blob: 5aad93815b6514021bf5cd1bf51866fad3c07224 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`parser` --- Access Python parse trees
3===========================================
4
5.. module:: parser
6 :synopsis: Access parse trees for Python source code.
7.. moduleauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
9
10
Christian Heimes5b5e81c2007-12-31 16:14:33 +000011.. Copyright 1995 Virginia Polytechnic Institute and State University and Fred
12 L. Drake, Jr. This copyright notice must be distributed on all copies, but
13 this document otherwise may be distributed as part of the Python
14 distribution. No fee may be charged for this document in any representation,
15 either on paper or electronically. This restriction does not affect other
16 elements in a distributed package in any way.
Georg Brandl116aa622007-08-15 14:28:22 +000017
18.. index:: single: parsing; Python source code
19
20The :mod:`parser` module provides an interface to Python's internal parser and
21byte-code compiler. The primary purpose for this interface is to allow Python
22code to edit the parse tree of a Python expression and create executable code
23from this. This is better than trying to parse and modify an arbitrary Python
24code fragment as a string because parsing is performed in a manner identical to
25the code forming the application. It is also faster.
26
Georg Brandl0c77a822008-06-10 16:37:50 +000027.. note::
28
29 From Python 2.5 onward, it's much more convenient to cut in at the Abstract
30 Syntax Tree (AST) generation and compilation stage, using the :mod:`ast`
31 module.
32
Georg Brandl116aa622007-08-15 14:28:22 +000033There are a few things to note about this module which are important to making
34use of the data structures created. This is not a tutorial on editing the parse
35trees for Python code, but some examples of using the :mod:`parser` module are
36presented.
37
38Most importantly, a good understanding of the Python grammar processed by the
39internal parser is required. For full information on the language syntax, refer
40to :ref:`reference-index`. The parser
41itself is created from a grammar specification defined in the file
42:file:`Grammar/Grammar` in the standard Python distribution. The parse trees
Georg Brandl0c77a822008-06-10 16:37:50 +000043stored in the ST objects created by this module are the actual output from the
Georg Brandl116aa622007-08-15 14:28:22 +000044internal parser when created by the :func:`expr` or :func:`suite` functions,
Georg Brandl0c77a822008-06-10 16:37:50 +000045described below. The ST objects created by :func:`sequence2st` faithfully
Georg Brandl116aa622007-08-15 14:28:22 +000046simulate those structures. Be aware that the values of the sequences which are
47considered "correct" will vary from one version of Python to another as the
48formal grammar for the language is revised. However, transporting code from one
49Python version to another as source text will always allow correct parse trees
50to be created in the target version, with the only restriction being that
51migrating to an older version of the interpreter will not support more recent
52language constructs. The parse trees are not typically compatible from one
53version to another, whereas source code has always been forward-compatible.
54
Georg Brandl0c77a822008-06-10 16:37:50 +000055Each element of the sequences returned by :func:`st2list` or :func:`st2tuple`
Georg Brandl116aa622007-08-15 14:28:22 +000056has a simple form. Sequences representing non-terminal elements in the grammar
57always have a length greater than one. The first element is an integer which
58identifies a production in the grammar. These integers are given symbolic names
59in the C header file :file:`Include/graminit.h` and the Python module
60:mod:`symbol`. Each additional element of the sequence represents a component
61of the production as recognized in the input string: these are always sequences
62which have the same form as the parent. An important aspect of this structure
63which should be noted is that keywords used to identify the parent node type,
64such as the keyword :keyword:`if` in an :const:`if_stmt`, are included in the
65node tree without any special treatment. For example, the :keyword:`if` keyword
66is represented by the tuple ``(1, 'if')``, where ``1`` is the numeric value
67associated with all :const:`NAME` tokens, including variable and function names
68defined by the user. In an alternate form returned when line number information
69is requested, the same token might be represented as ``(1, 'if', 12)``, where
70the ``12`` represents the line number at which the terminal symbol was found.
71
72Terminal elements are represented in much the same way, but without any child
73elements and the addition of the source text which was identified. The example
74of the :keyword:`if` keyword above is representative. The various types of
75terminal symbols are defined in the C header file :file:`Include/token.h` and
76the Python module :mod:`token`.
77
Georg Brandl0c77a822008-06-10 16:37:50 +000078The ST objects are not required to support the functionality of this module,
Georg Brandl116aa622007-08-15 14:28:22 +000079but are provided for three purposes: to allow an application to amortize the
80cost of processing complex parse trees, to provide a parse tree representation
81which conserves memory space when compared to the Python list or tuple
82representation, and to ease the creation of additional modules in C which
83manipulate parse trees. A simple "wrapper" class may be created in Python to
Georg Brandl0c77a822008-06-10 16:37:50 +000084hide the use of ST objects.
Georg Brandl116aa622007-08-15 14:28:22 +000085
86The :mod:`parser` module defines functions for a few distinct purposes. The
Georg Brandl0c77a822008-06-10 16:37:50 +000087most important purposes are to create ST objects and to convert ST objects to
Georg Brandl116aa622007-08-15 14:28:22 +000088other representations such as parse trees and compiled code objects, but there
89are also functions which serve to query the type of parse tree represented by an
Georg Brandl0c77a822008-06-10 16:37:50 +000090ST object.
Georg Brandl116aa622007-08-15 14:28:22 +000091
92
93.. seealso::
94
95 Module :mod:`symbol`
96 Useful constants representing internal nodes of the parse tree.
97
98 Module :mod:`token`
99 Useful constants representing leaf nodes of the parse tree and functions for
100 testing node values.
101
102
Georg Brandl0c77a822008-06-10 16:37:50 +0000103.. _creating-sts:
Georg Brandl116aa622007-08-15 14:28:22 +0000104
Georg Brandl0c77a822008-06-10 16:37:50 +0000105Creating ST Objects
106-------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000107
Georg Brandl0c77a822008-06-10 16:37:50 +0000108ST objects may be created from source code or from a parse tree. When creating
109an ST object from source, different functions are used to create the ``'eval'``
Georg Brandl116aa622007-08-15 14:28:22 +0000110and ``'exec'`` forms.
111
112
113.. function:: expr(source)
114
115 The :func:`expr` function parses the parameter *source* as if it were an input
Georg Brandl0c77a822008-06-10 16:37:50 +0000116 to ``compile(source, 'file.py', 'eval')``. If the parse succeeds, an ST object
Georg Brandl116aa622007-08-15 14:28:22 +0000117 is created to hold the internal parse tree representation, otherwise an
118 appropriate exception is thrown.
119
120
121.. function:: suite(source)
122
123 The :func:`suite` function parses the parameter *source* as if it were an input
Georg Brandl0c77a822008-06-10 16:37:50 +0000124 to ``compile(source, 'file.py', 'exec')``. If the parse succeeds, an ST object
Georg Brandl116aa622007-08-15 14:28:22 +0000125 is created to hold the internal parse tree representation, otherwise an
126 appropriate exception is thrown.
127
128
Georg Brandl0c77a822008-06-10 16:37:50 +0000129.. function:: sequence2st(sequence)
Georg Brandl116aa622007-08-15 14:28:22 +0000130
131 This function accepts a parse tree represented as a sequence and builds an
132 internal representation if possible. If it can validate that the tree conforms
133 to the Python grammar and all nodes are valid node types in the host version of
Georg Brandl0c77a822008-06-10 16:37:50 +0000134 Python, an ST object is created from the internal representation and returned
Georg Brandl116aa622007-08-15 14:28:22 +0000135 to the called. If there is a problem creating the internal representation, or
136 if the tree cannot be validated, a :exc:`ParserError` exception is thrown. An
Georg Brandl0c77a822008-06-10 16:37:50 +0000137 ST object created this way should not be assumed to compile correctly; normal
138 exceptions thrown by compilation may still be initiated when the ST object is
139 passed to :func:`compilest`. This may indicate problems not related to syntax
Georg Brandl116aa622007-08-15 14:28:22 +0000140 (such as a :exc:`MemoryError` exception), but may also be due to constructs such
141 as the result of parsing ``del f(0)``, which escapes the Python parser but is
142 checked by the bytecode compiler.
143
144 Sequences representing terminal tokens may be represented as either two-element
145 lists of the form ``(1, 'name')`` or as three-element lists of the form ``(1,
146 'name', 56)``. If the third element is present, it is assumed to be a valid
147 line number. The line number may be specified for any subset of the terminal
148 symbols in the input tree.
149
150
Georg Brandl0c77a822008-06-10 16:37:50 +0000151.. function:: tuple2st(sequence)
Georg Brandl116aa622007-08-15 14:28:22 +0000152
Georg Brandl0c77a822008-06-10 16:37:50 +0000153 This is the same function as :func:`sequence2st`. This entry point is
Georg Brandl116aa622007-08-15 14:28:22 +0000154 maintained for backward compatibility.
155
156
Georg Brandl0c77a822008-06-10 16:37:50 +0000157.. _converting-sts:
Georg Brandl116aa622007-08-15 14:28:22 +0000158
Georg Brandl0c77a822008-06-10 16:37:50 +0000159Converting ST Objects
160---------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000161
Georg Brandl0c77a822008-06-10 16:37:50 +0000162ST objects, regardless of the input used to create them, may be converted to
Georg Brandl116aa622007-08-15 14:28:22 +0000163parse trees represented as list- or tuple- trees, or may be compiled into
164executable code objects. Parse trees may be extracted with or without line
165numbering information.
166
167
Georg Brandl30704ea02008-07-23 15:07:12 +0000168.. function:: st2list(st[, line_info])
Georg Brandl116aa622007-08-15 14:28:22 +0000169
Georg Brandl30704ea02008-07-23 15:07:12 +0000170 This function accepts an ST object from the caller in *st* and returns a
Georg Brandl116aa622007-08-15 14:28:22 +0000171 Python list representing the equivalent parse tree. The resulting list
172 representation can be used for inspection or the creation of a new parse tree in
173 list form. This function does not fail so long as memory is available to build
174 the list representation. If the parse tree will only be used for inspection,
Georg Brandl0c77a822008-06-10 16:37:50 +0000175 :func:`st2tuple` should be used instead to reduce memory consumption and
Georg Brandl116aa622007-08-15 14:28:22 +0000176 fragmentation. When the list representation is required, this function is
177 significantly faster than retrieving a tuple representation and converting that
178 to nested lists.
179
180 If *line_info* is true, line number information will be included for all
181 terminal tokens as a third element of the list representing the token. Note
182 that the line number provided specifies the line on which the token *ends*.
183 This information is omitted if the flag is false or omitted.
184
185
Georg Brandl30704ea02008-07-23 15:07:12 +0000186.. function:: st2tuple(st[, line_info])
Georg Brandl116aa622007-08-15 14:28:22 +0000187
Georg Brandl30704ea02008-07-23 15:07:12 +0000188 This function accepts an ST object from the caller in *st* and returns a
Georg Brandl116aa622007-08-15 14:28:22 +0000189 Python tuple representing the equivalent parse tree. Other than returning a
Georg Brandl0c77a822008-06-10 16:37:50 +0000190 tuple instead of a list, this function is identical to :func:`st2list`.
Georg Brandl116aa622007-08-15 14:28:22 +0000191
192 If *line_info* is true, line number information will be included for all
193 terminal tokens as a third element of the list representing the token. This
194 information is omitted if the flag is false or omitted.
195
196
Georg Brandl30704ea02008-07-23 15:07:12 +0000197.. function:: compilest(st[, filename='<syntax-tree>'])
Georg Brandl116aa622007-08-15 14:28:22 +0000198
199 .. index::
200 builtin: exec
201 builtin: eval
202
Georg Brandl0c77a822008-06-10 16:37:50 +0000203 The Python byte compiler can be invoked on an ST object to produce code objects
Georg Brandl116aa622007-08-15 14:28:22 +0000204 which can be used as part of a call to the built-in :func:`exec` or :func:`eval`
205 functions. This function provides the interface to the compiler, passing the
Georg Brandl30704ea02008-07-23 15:07:12 +0000206 internal parse tree from *st* to the parser, using the source file name
Georg Brandl116aa622007-08-15 14:28:22 +0000207 specified by the *filename* parameter. The default value supplied for *filename*
Georg Brandl0c77a822008-06-10 16:37:50 +0000208 indicates that the source was an ST object.
Georg Brandl116aa622007-08-15 14:28:22 +0000209
Georg Brandl0c77a822008-06-10 16:37:50 +0000210 Compiling an ST object may result in exceptions related to compilation; an
Georg Brandl116aa622007-08-15 14:28:22 +0000211 example would be a :exc:`SyntaxError` caused by the parse tree for ``del f(0)``:
212 this statement is considered legal within the formal grammar for Python but is
213 not a legal language construct. The :exc:`SyntaxError` raised for this
214 condition is actually generated by the Python byte-compiler normally, which is
215 why it can be raised at this point by the :mod:`parser` module. Most causes of
216 compilation failure can be diagnosed programmatically by inspection of the parse
217 tree.
218
219
Georg Brandl0c77a822008-06-10 16:37:50 +0000220.. _querying-sts:
Georg Brandl116aa622007-08-15 14:28:22 +0000221
Georg Brandl0c77a822008-06-10 16:37:50 +0000222Queries on ST Objects
223---------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000224
Georg Brandl0c77a822008-06-10 16:37:50 +0000225Two functions are provided which allow an application to determine if an ST was
Georg Brandl116aa622007-08-15 14:28:22 +0000226created as an expression or a suite. Neither of these functions can be used to
Georg Brandl0c77a822008-06-10 16:37:50 +0000227determine if an ST was created from source code via :func:`expr` or
228:func:`suite` or from a parse tree via :func:`sequence2st`.
Georg Brandl116aa622007-08-15 14:28:22 +0000229
230
Georg Brandl30704ea02008-07-23 15:07:12 +0000231.. function:: isexpr(st)
Georg Brandl116aa622007-08-15 14:28:22 +0000232
233 .. index:: builtin: compile
234
Georg Brandl30704ea02008-07-23 15:07:12 +0000235 When *st* represents an ``'eval'`` form, this function returns true, otherwise
Georg Brandl116aa622007-08-15 14:28:22 +0000236 it returns false. This is useful, since code objects normally cannot be queried
237 for this information using existing built-in functions. Note that the code
Georg Brandl0c77a822008-06-10 16:37:50 +0000238 objects created by :func:`compilest` cannot be queried like this either, and
Georg Brandl116aa622007-08-15 14:28:22 +0000239 are identical to those created by the built-in :func:`compile` function.
240
241
Georg Brandl30704ea02008-07-23 15:07:12 +0000242.. function:: issuite(st)
Georg Brandl116aa622007-08-15 14:28:22 +0000243
Georg Brandl0c77a822008-06-10 16:37:50 +0000244 This function mirrors :func:`isexpr` in that it reports whether an ST object
Georg Brandl116aa622007-08-15 14:28:22 +0000245 represents an ``'exec'`` form, commonly known as a "suite." It is not safe to
Georg Brandl30704ea02008-07-23 15:07:12 +0000246 assume that this function is equivalent to ``not isexpr(st)``, as additional
Georg Brandl116aa622007-08-15 14:28:22 +0000247 syntactic fragments may be supported in the future.
248
249
Georg Brandl0c77a822008-06-10 16:37:50 +0000250.. _st-errors:
Georg Brandl116aa622007-08-15 14:28:22 +0000251
252Exceptions and Error Handling
253-----------------------------
254
255The parser module defines a single exception, but may also pass other built-in
256exceptions from other portions of the Python runtime environment. See each
257function for information about the exceptions it can raise.
258
259
260.. exception:: ParserError
261
262 Exception raised when a failure occurs within the parser module. This is
263 generally produced for validation failures rather than the built in
264 :exc:`SyntaxError` thrown during normal parsing. The exception argument is
265 either a string describing the reason of the failure or a tuple containing a
Georg Brandl0c77a822008-06-10 16:37:50 +0000266 sequence causing the failure from a parse tree passed to :func:`sequence2st`
267 and an explanatory string. Calls to :func:`sequence2st` need to be able to
Georg Brandl116aa622007-08-15 14:28:22 +0000268 handle either type of exception, while calls to other functions in the module
269 will only need to be aware of the simple string values.
270
Georg Brandl0c77a822008-06-10 16:37:50 +0000271Note that the functions :func:`compilest`, :func:`expr`, and :func:`suite` may
Georg Brandl116aa622007-08-15 14:28:22 +0000272throw exceptions which are normally thrown by the parsing and compilation
273process. These include the built in exceptions :exc:`MemoryError`,
274:exc:`OverflowError`, :exc:`SyntaxError`, and :exc:`SystemError`. In these
275cases, these exceptions carry all the meaning normally associated with them.
276Refer to the descriptions of each function for detailed information.
277
278
Georg Brandl0c77a822008-06-10 16:37:50 +0000279.. _st-objects:
Georg Brandl116aa622007-08-15 14:28:22 +0000280
Georg Brandl0c77a822008-06-10 16:37:50 +0000281ST Objects
282----------
Georg Brandl116aa622007-08-15 14:28:22 +0000283
Georg Brandl0c77a822008-06-10 16:37:50 +0000284Ordered and equality comparisons are supported between ST objects. Pickling of
285ST objects (using the :mod:`pickle` module) is also supported.
Georg Brandl116aa622007-08-15 14:28:22 +0000286
287
Georg Brandl0c77a822008-06-10 16:37:50 +0000288.. data:: STType
Georg Brandl116aa622007-08-15 14:28:22 +0000289
290 The type of the objects returned by :func:`expr`, :func:`suite` and
Georg Brandl0c77a822008-06-10 16:37:50 +0000291 :func:`sequence2st`.
Georg Brandl116aa622007-08-15 14:28:22 +0000292
Georg Brandl0c77a822008-06-10 16:37:50 +0000293ST objects have the following methods:
Georg Brandl116aa622007-08-15 14:28:22 +0000294
295
Georg Brandl0c77a822008-06-10 16:37:50 +0000296.. method:: ST.compile([filename])
Georg Brandl116aa622007-08-15 14:28:22 +0000297
Georg Brandl0c77a822008-06-10 16:37:50 +0000298 Same as ``compilest(st, filename)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000299
300
Georg Brandl0c77a822008-06-10 16:37:50 +0000301.. method:: ST.isexpr()
Georg Brandl116aa622007-08-15 14:28:22 +0000302
Georg Brandl0c77a822008-06-10 16:37:50 +0000303 Same as ``isexpr(st)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000304
305
Georg Brandl0c77a822008-06-10 16:37:50 +0000306.. method:: ST.issuite()
Georg Brandl116aa622007-08-15 14:28:22 +0000307
Georg Brandl0c77a822008-06-10 16:37:50 +0000308 Same as ``issuite(st)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000309
310
Georg Brandl0c77a822008-06-10 16:37:50 +0000311.. method:: ST.tolist([line_info])
Georg Brandl116aa622007-08-15 14:28:22 +0000312
Georg Brandl0c77a822008-06-10 16:37:50 +0000313 Same as ``st2list(st, line_info)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000314
315
Georg Brandl0c77a822008-06-10 16:37:50 +0000316.. method:: ST.totuple([line_info])
Georg Brandl116aa622007-08-15 14:28:22 +0000317
Georg Brandl0c77a822008-06-10 16:37:50 +0000318 Same as ``st2tuple(st, line_info)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000319
320
Georg Brandl0c77a822008-06-10 16:37:50 +0000321.. _st-examples:
Georg Brandl116aa622007-08-15 14:28:22 +0000322
323Examples
324--------
325
326.. index:: builtin: compile
327
328The parser modules allows operations to be performed on the parse tree of Python
Georg Brandl9afde1c2007-11-01 20:32:30 +0000329source code before the :term:`bytecode` is generated, and provides for inspection of the
Georg Brandl116aa622007-08-15 14:28:22 +0000330parse tree for information gathering purposes. Two examples are presented. The
331simple example demonstrates emulation of the :func:`compile` built-in function
332and the complex example shows the use of a parse tree for information discovery.
333
334
335Emulation of :func:`compile`
336^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
338While many useful operations may take place between parsing and bytecode
339generation, the simplest operation is to do nothing. For this purpose, using
340the :mod:`parser` module to produce an intermediate data structure is equivalent
341to the code ::
342
343 >>> code = compile('a + 5', 'file.py', 'eval')
344 >>> a = 5
345 >>> eval(code)
346 10
347
348The equivalent operation using the :mod:`parser` module is somewhat longer, and
Georg Brandl0c77a822008-06-10 16:37:50 +0000349allows the intermediate internal parse tree to be retained as an ST object::
Georg Brandl116aa622007-08-15 14:28:22 +0000350
351 >>> import parser
Georg Brandl0c77a822008-06-10 16:37:50 +0000352 >>> st = parser.expr('a + 5')
353 >>> code = st.compile('file.py')
Georg Brandl116aa622007-08-15 14:28:22 +0000354 >>> a = 5
355 >>> eval(code)
356 10
357
Georg Brandl0c77a822008-06-10 16:37:50 +0000358An application which needs both ST and code objects can package this code into
Georg Brandl116aa622007-08-15 14:28:22 +0000359readily available functions::
360
361 import parser
362
363 def load_suite(source_string):
Georg Brandl0c77a822008-06-10 16:37:50 +0000364 st = parser.suite(source_string)
365 return st, st.compile()
Georg Brandl116aa622007-08-15 14:28:22 +0000366
367 def load_expression(source_string):
Georg Brandl0c77a822008-06-10 16:37:50 +0000368 st = parser.expr(source_string)
369 return st, st.compile()
Georg Brandl116aa622007-08-15 14:28:22 +0000370
371
372Information Discovery
373^^^^^^^^^^^^^^^^^^^^^
374
375.. index::
376 single: string; documentation
377 single: docstrings
378
379Some applications benefit from direct access to the parse tree. The remainder
380of this section demonstrates how the parse tree provides access to module
381documentation defined in docstrings without requiring that the code being
382examined be loaded into a running interpreter via :keyword:`import`. This can
383be very useful for performing analyses of untrusted code.
384
385Generally, the example will demonstrate how the parse tree may be traversed to
386distill interesting information. Two functions and a set of classes are
387developed which provide programmatic access to high level function and class
388definitions provided by a module. The classes extract information from the
389parse tree and provide access to the information at a useful semantic level, one
390function provides a simple low-level pattern matching capability, and the other
391function defines a high-level interface to the classes by handling file
392operations on behalf of the caller. All source files mentioned here which are
393not part of the Python installation are located in the :file:`Demo/parser/`
394directory of the distribution.
395
396The dynamic nature of Python allows the programmer a great deal of flexibility,
397but most modules need only a limited measure of this when defining classes,
398functions, and methods. In this example, the only definitions that will be
399considered are those which are defined in the top level of their context, e.g.,
400a function defined by a :keyword:`def` statement at column zero of a module, but
401not a function defined within a branch of an :keyword:`if` ... :keyword:`else`
402construct, though there are some good reasons for doing so in some situations.
403Nesting of definitions will be handled by the code developed in the example.
404
405To construct the upper-level extraction methods, we need to know what the parse
406tree structure looks like and how much of it we actually need to be concerned
407about. Python uses a moderately deep parse tree so there are a large number of
408intermediate nodes. It is important to read and understand the formal grammar
409used by Python. This is specified in the file :file:`Grammar/Grammar` in the
410distribution. Consider the simplest case of interest when searching for
411docstrings: a module consisting of a docstring and nothing else. (See file
412:file:`docstring.py`.) ::
413
414 """Some documentation.
415 """
416
417Using the interpreter to take a look at the parse tree, we find a bewildering
418mass of numbers and parentheses, with the documentation buried deep in nested
419tuples. ::
420
421 >>> import parser
422 >>> import pprint
Georg Brandl0c77a822008-06-10 16:37:50 +0000423 >>> st = parser.suite(open('docstring.py').read())
424 >>> tup = st.totuple()
Georg Brandl116aa622007-08-15 14:28:22 +0000425 >>> pprint.pprint(tup)
426 (257,
427 (264,
428 (265,
429 (266,
430 (267,
431 (307,
432 (287,
433 (288,
434 (289,
435 (290,
436 (292,
437 (293,
438 (294,
439 (295,
440 (296,
441 (297,
442 (298,
443 (299,
444 (300, (3, '"""Some documentation.\n"""'))))))))))))))))),
445 (4, ''))),
446 (4, ''),
447 (0, ''))
448
449The numbers at the first element of each node in the tree are the node types;
450they map directly to terminal and non-terminal symbols in the grammar.
451Unfortunately, they are represented as integers in the internal representation,
452and the Python structures generated do not change that. However, the
453:mod:`symbol` and :mod:`token` modules provide symbolic names for the node types
454and dictionaries which map from the integers to the symbolic names for the node
455types.
456
457In the output presented above, the outermost tuple contains four elements: the
458integer ``257`` and three additional tuples. Node type ``257`` has the symbolic
459name :const:`file_input`. Each of these inner tuples contains an integer as the
460first element; these integers, ``264``, ``4``, and ``0``, represent the node
461types :const:`stmt`, :const:`NEWLINE`, and :const:`ENDMARKER`, respectively.
462Note that these values may change depending on the version of Python you are
463using; consult :file:`symbol.py` and :file:`token.py` for details of the
464mapping. It should be fairly clear that the outermost node is related primarily
465to the input source rather than the contents of the file, and may be disregarded
466for the moment. The :const:`stmt` node is much more interesting. In
467particular, all docstrings are found in subtrees which are formed exactly as
468this node is formed, with the only difference being the string itself. The
469association between the docstring in a similar tree and the defined entity
470(class, function, or module) which it describes is given by the position of the
471docstring subtree within the tree defining the described structure.
472
473By replacing the actual docstring with something to signify a variable component
474of the tree, we allow a simple pattern matching approach to check any given
475subtree for equivalence to the general pattern for docstrings. Since the
476example demonstrates information extraction, we can safely require that the tree
477be in tuple form rather than list form, allowing a simple variable
478representation to be ``['variable_name']``. A simple recursive function can
479implement the pattern matching, returning a Boolean and a dictionary of variable
480name to value mappings. (See file :file:`example.py`.) ::
481
Georg Brandl116aa622007-08-15 14:28:22 +0000482 def match(pattern, data, vars=None):
483 if vars is None:
484 vars = {}
Collin Winter1b1498b2007-08-28 06:10:19 +0000485 if isinstance(pattern, list):
Georg Brandl116aa622007-08-15 14:28:22 +0000486 vars[pattern[0]] = data
Collin Winter1b1498b2007-08-28 06:10:19 +0000487 return True, vars
488 if not instance(pattern, tuple):
Georg Brandl116aa622007-08-15 14:28:22 +0000489 return (pattern == data), vars
490 if len(data) != len(pattern):
Collin Winter1b1498b2007-08-28 06:10:19 +0000491 return False, vars
492 for pattern, data in zip(pattern, data):
Georg Brandl116aa622007-08-15 14:28:22 +0000493 same, vars = match(pattern, data, vars)
494 if not same:
495 break
496 return same, vars
497
498Using this simple representation for syntactic variables and the symbolic node
499types, the pattern for the candidate docstring subtrees becomes fairly readable.
500(See file :file:`example.py`.) ::
501
502 import symbol
503 import token
504
505 DOCSTRING_STMT_PATTERN = (
506 symbol.stmt,
507 (symbol.simple_stmt,
508 (symbol.small_stmt,
509 (symbol.expr_stmt,
510 (symbol.testlist,
511 (symbol.test,
512 (symbol.and_test,
513 (symbol.not_test,
514 (symbol.comparison,
515 (symbol.expr,
516 (symbol.xor_expr,
517 (symbol.and_expr,
518 (symbol.shift_expr,
519 (symbol.arith_expr,
520 (symbol.term,
521 (symbol.factor,
522 (symbol.power,
523 (symbol.atom,
524 (token.STRING, ['docstring'])
525 )))))))))))))))),
526 (token.NEWLINE, '')
527 ))
528
529Using the :func:`match` function with this pattern, extracting the module
530docstring from the parse tree created previously is easy::
531
532 >>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1])
533 >>> found
Collin Winter1b1498b2007-08-28 06:10:19 +0000534 True
Georg Brandl116aa622007-08-15 14:28:22 +0000535 >>> vars
536 {'docstring': '"""Some documentation.\n"""'}
537
538Once specific data can be extracted from a location where it is expected, the
539question of where information can be expected needs to be answered. When
540dealing with docstrings, the answer is fairly simple: the docstring is the first
541:const:`stmt` node in a code block (:const:`file_input` or :const:`suite` node
542types). A module consists of a single :const:`file_input` node, and class and
543function definitions each contain exactly one :const:`suite` node. Classes and
544functions are readily identified as subtrees of code block nodes which start
545with ``(stmt, (compound_stmt, (classdef, ...`` or ``(stmt, (compound_stmt,
546(funcdef, ...``. Note that these subtrees cannot be matched by :func:`match`
547since it does not support multiple sibling nodes to match without regard to
548number. A more elaborate matching function could be used to overcome this
549limitation, but this is sufficient for the example.
550
551Given the ability to determine whether a statement might be a docstring and
552extract the actual string from the statement, some work needs to be performed to
553walk the parse tree for an entire module and extract information about the names
554defined in each context of the module and associate any docstrings with the
555names. The code to perform this work is not complicated, but bears some
556explanation.
557
558The public interface to the classes is straightforward and should probably be
559somewhat more flexible. Each "major" block of the module is described by an
560object providing several methods for inquiry and a constructor which accepts at
561least the subtree of the complete parse tree which it represents. The
562:class:`ModuleInfo` constructor accepts an optional *name* parameter since it
563cannot otherwise determine the name of the module.
564
565The public classes include :class:`ClassInfo`, :class:`FunctionInfo`, and
566:class:`ModuleInfo`. All objects provide the methods :meth:`get_name`,
567:meth:`get_docstring`, :meth:`get_class_names`, and :meth:`get_class_info`. The
568:class:`ClassInfo` objects support :meth:`get_method_names` and
569:meth:`get_method_info` while the other classes provide
570:meth:`get_function_names` and :meth:`get_function_info`.
571
572Within each of the forms of code block that the public classes represent, most
573of the required information is in the same form and is accessed in the same way,
574with classes having the distinction that functions defined at the top level are
575referred to as "methods." Since the difference in nomenclature reflects a real
576semantic distinction from functions defined outside of a class, the
577implementation needs to maintain the distinction. Hence, most of the
578functionality of the public classes can be implemented in a common base class,
579:class:`SuiteInfoBase`, with the accessors for function and method information
580provided elsewhere. Note that there is only one class which represents function
581and method information; this parallels the use of the :keyword:`def` statement
582to define both types of elements.
583
584Most of the accessor functions are declared in :class:`SuiteInfoBase` and do not
585need to be overridden by subclasses. More importantly, the extraction of most
586information from a parse tree is handled through a method called by the
587:class:`SuiteInfoBase` constructor. The example code for most of the classes is
588clear when read alongside the formal grammar, but the method which recursively
589creates new information objects requires further examination. Here is the
590relevant part of the :class:`SuiteInfoBase` definition from :file:`example.py`::
591
592 class SuiteInfoBase:
593 _docstring = ''
594 _name = ''
595
596 def __init__(self, tree = None):
597 self._class_info = {}
598 self._function_info = {}
599 if tree:
600 self._extract_info(tree)
601
602 def _extract_info(self, tree):
603 # extract docstring
604 if len(tree) == 2:
605 found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1])
606 else:
607 found, vars = match(DOCSTRING_STMT_PATTERN, tree[3])
608 if found:
609 self._docstring = eval(vars['docstring'])
610 # discover inner definitions
611 for node in tree[1:]:
612 found, vars = match(COMPOUND_STMT_PATTERN, node)
613 if found:
614 cstmt = vars['compound']
615 if cstmt[0] == symbol.funcdef:
616 name = cstmt[2][1]
617 self._function_info[name] = FunctionInfo(cstmt)
618 elif cstmt[0] == symbol.classdef:
619 name = cstmt[2][1]
620 self._class_info[name] = ClassInfo(cstmt)
621
622After initializing some internal state, the constructor calls the
623:meth:`_extract_info` method. This method performs the bulk of the information
624extraction which takes place in the entire example. The extraction has two
625distinct phases: the location of the docstring for the parse tree passed in, and
626the discovery of additional definitions within the code block represented by the
627parse tree.
628
629The initial :keyword:`if` test determines whether the nested suite is of the
630"short form" or the "long form." The short form is used when the code block is
631on the same line as the definition of the code block, as in ::
632
633 def square(x): "Square an argument."; return x ** 2
634
635while the long form uses an indented block and allows nested definitions::
636
637 def make_power(exp):
Georg Brandl1f01deb2009-01-03 22:47:39 +0000638 "Make a function that raises an argument to the exponent `exp`."
Georg Brandl116aa622007-08-15 14:28:22 +0000639 def raiser(x, y=exp):
640 return x ** y
641 return raiser
642
643When the short form is used, the code block may contain a docstring as the
644first, and possibly only, :const:`small_stmt` element. The extraction of such a
645docstring is slightly different and requires only a portion of the complete
646pattern used in the more common case. As implemented, the docstring will only
647be found if there is only one :const:`small_stmt` node in the
648:const:`simple_stmt` node. Since most functions and methods which use the short
649form do not provide a docstring, this may be considered sufficient. The
650extraction of the docstring proceeds using the :func:`match` function as
651described above, and the value of the docstring is stored as an attribute of the
652:class:`SuiteInfoBase` object.
653
654After docstring extraction, a simple definition discovery algorithm operates on
655the :const:`stmt` nodes of the :const:`suite` node. The special case of the
656short form is not tested; since there are no :const:`stmt` nodes in the short
657form, the algorithm will silently skip the single :const:`simple_stmt` node and
658correctly not discover any nested definitions.
659
660Each statement in the code block is categorized as a class definition, function
661or method definition, or something else. For the definition statements, the
662name of the element defined is extracted and a representation object appropriate
663to the definition is created with the defining subtree passed as an argument to
664the constructor. The representation objects are stored in instance variables
665and may be retrieved by name using the appropriate accessor methods.
666
667The public classes provide any accessors required which are more specific than
668those provided by the :class:`SuiteInfoBase` class, but the real extraction
669algorithm remains common to all forms of code blocks. A high-level function can
670be used to extract the complete set of information from a source file. (See
671file :file:`example.py`.) ::
672
673 def get_docs(fileName):
674 import os
675 import parser
676
677 source = open(fileName).read()
678 basename = os.path.basename(os.path.splitext(fileName)[0])
Georg Brandl0c77a822008-06-10 16:37:50 +0000679 st = parser.suite(source)
680 return ModuleInfo(st.totuple(), basename)
Georg Brandl116aa622007-08-15 14:28:22 +0000681
682This provides an easy-to-use interface to the documentation of a module. If
683information is required which is not extracted by the code of this example, the
684code may be extended at clearly defined points to provide additional
685capabilities.
686