Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1 | |
| 2 | :mod:`parser` --- Access Python parse trees |
| 3 | =========================================== |
| 4 | |
| 5 | .. module:: parser |
| 6 | :synopsis: Access parse trees for Python source code. |
| 7 | .. moduleauthor:: Fred L. Drake, Jr. <fdrake@acm.org> |
| 8 | .. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> |
| 9 | |
| 10 | |
Christian Heimes | 5b5e81c | 2007-12-31 16:14:33 +0000 | [diff] [blame] | 11 | .. Copyright 1995 Virginia Polytechnic Institute and State University and Fred |
| 12 | L. Drake, Jr. This copyright notice must be distributed on all copies, but |
| 13 | this document otherwise may be distributed as part of the Python |
| 14 | distribution. No fee may be charged for this document in any representation, |
| 15 | either on paper or electronically. This restriction does not affect other |
| 16 | elements in a distributed package in any way. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 17 | |
| 18 | .. index:: single: parsing; Python source code |
| 19 | |
| 20 | The :mod:`parser` module provides an interface to Python's internal parser and |
| 21 | byte-code compiler. The primary purpose for this interface is to allow Python |
| 22 | code to edit the parse tree of a Python expression and create executable code |
| 23 | from this. This is better than trying to parse and modify an arbitrary Python |
| 24 | code fragment as a string because parsing is performed in a manner identical to |
| 25 | the code forming the application. It is also faster. |
| 26 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 27 | .. note:: |
| 28 | |
| 29 | From Python 2.5 onward, it's much more convenient to cut in at the Abstract |
| 30 | Syntax Tree (AST) generation and compilation stage, using the :mod:`ast` |
| 31 | module. |
| 32 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 33 | There are a few things to note about this module which are important to making |
| 34 | use of the data structures created. This is not a tutorial on editing the parse |
| 35 | trees for Python code, but some examples of using the :mod:`parser` module are |
| 36 | presented. |
| 37 | |
| 38 | Most importantly, a good understanding of the Python grammar processed by the |
| 39 | internal parser is required. For full information on the language syntax, refer |
| 40 | to :ref:`reference-index`. The parser |
| 41 | itself is created from a grammar specification defined in the file |
| 42 | :file:`Grammar/Grammar` in the standard Python distribution. The parse trees |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 43 | stored in the ST objects created by this module are the actual output from the |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 44 | internal parser when created by the :func:`expr` or :func:`suite` functions, |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 45 | described below. The ST objects created by :func:`sequence2st` faithfully |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 46 | simulate those structures. Be aware that the values of the sequences which are |
| 47 | considered "correct" will vary from one version of Python to another as the |
| 48 | formal grammar for the language is revised. However, transporting code from one |
| 49 | Python version to another as source text will always allow correct parse trees |
| 50 | to be created in the target version, with the only restriction being that |
| 51 | migrating to an older version of the interpreter will not support more recent |
| 52 | language constructs. The parse trees are not typically compatible from one |
| 53 | version to another, whereas source code has always been forward-compatible. |
| 54 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 55 | Each element of the sequences returned by :func:`st2list` or :func:`st2tuple` |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 56 | has a simple form. Sequences representing non-terminal elements in the grammar |
| 57 | always have a length greater than one. The first element is an integer which |
| 58 | identifies a production in the grammar. These integers are given symbolic names |
| 59 | in the C header file :file:`Include/graminit.h` and the Python module |
| 60 | :mod:`symbol`. Each additional element of the sequence represents a component |
| 61 | of the production as recognized in the input string: these are always sequences |
| 62 | which have the same form as the parent. An important aspect of this structure |
| 63 | which should be noted is that keywords used to identify the parent node type, |
| 64 | such as the keyword :keyword:`if` in an :const:`if_stmt`, are included in the |
| 65 | node tree without any special treatment. For example, the :keyword:`if` keyword |
| 66 | is represented by the tuple ``(1, 'if')``, where ``1`` is the numeric value |
| 67 | associated with all :const:`NAME` tokens, including variable and function names |
| 68 | defined by the user. In an alternate form returned when line number information |
| 69 | is requested, the same token might be represented as ``(1, 'if', 12)``, where |
| 70 | the ``12`` represents the line number at which the terminal symbol was found. |
| 71 | |
| 72 | Terminal elements are represented in much the same way, but without any child |
| 73 | elements and the addition of the source text which was identified. The example |
| 74 | of the :keyword:`if` keyword above is representative. The various types of |
| 75 | terminal symbols are defined in the C header file :file:`Include/token.h` and |
| 76 | the Python module :mod:`token`. |
| 77 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 78 | The ST objects are not required to support the functionality of this module, |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 79 | but are provided for three purposes: to allow an application to amortize the |
| 80 | cost of processing complex parse trees, to provide a parse tree representation |
| 81 | which conserves memory space when compared to the Python list or tuple |
| 82 | representation, and to ease the creation of additional modules in C which |
| 83 | manipulate parse trees. A simple "wrapper" class may be created in Python to |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 84 | hide the use of ST objects. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 85 | |
| 86 | The :mod:`parser` module defines functions for a few distinct purposes. The |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 87 | most important purposes are to create ST objects and to convert ST objects to |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 88 | other representations such as parse trees and compiled code objects, but there |
| 89 | are also functions which serve to query the type of parse tree represented by an |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 90 | ST object. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 91 | |
| 92 | |
| 93 | .. seealso:: |
| 94 | |
| 95 | Module :mod:`symbol` |
| 96 | Useful constants representing internal nodes of the parse tree. |
| 97 | |
| 98 | Module :mod:`token` |
| 99 | Useful constants representing leaf nodes of the parse tree and functions for |
| 100 | testing node values. |
| 101 | |
| 102 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 103 | .. _creating-sts: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 104 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 105 | Creating ST Objects |
| 106 | ------------------- |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 107 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 108 | ST objects may be created from source code or from a parse tree. When creating |
| 109 | an ST object from source, different functions are used to create the ``'eval'`` |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 110 | and ``'exec'`` forms. |
| 111 | |
| 112 | |
| 113 | .. function:: expr(source) |
| 114 | |
| 115 | The :func:`expr` function parses the parameter *source* as if it were an input |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 116 | to ``compile(source, 'file.py', 'eval')``. If the parse succeeds, an ST object |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 117 | is created to hold the internal parse tree representation, otherwise an |
| 118 | appropriate exception is thrown. |
| 119 | |
| 120 | |
| 121 | .. function:: suite(source) |
| 122 | |
| 123 | The :func:`suite` function parses the parameter *source* as if it were an input |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 124 | to ``compile(source, 'file.py', 'exec')``. If the parse succeeds, an ST object |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 125 | is created to hold the internal parse tree representation, otherwise an |
| 126 | appropriate exception is thrown. |
| 127 | |
| 128 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 129 | .. function:: sequence2st(sequence) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 130 | |
| 131 | This function accepts a parse tree represented as a sequence and builds an |
| 132 | internal representation if possible. If it can validate that the tree conforms |
| 133 | to the Python grammar and all nodes are valid node types in the host version of |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 134 | Python, an ST object is created from the internal representation and returned |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 135 | to the called. If there is a problem creating the internal representation, or |
| 136 | if the tree cannot be validated, a :exc:`ParserError` exception is thrown. An |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 137 | ST object created this way should not be assumed to compile correctly; normal |
| 138 | exceptions thrown by compilation may still be initiated when the ST object is |
| 139 | passed to :func:`compilest`. This may indicate problems not related to syntax |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 140 | (such as a :exc:`MemoryError` exception), but may also be due to constructs such |
| 141 | as the result of parsing ``del f(0)``, which escapes the Python parser but is |
| 142 | checked by the bytecode compiler. |
| 143 | |
| 144 | Sequences representing terminal tokens may be represented as either two-element |
| 145 | lists of the form ``(1, 'name')`` or as three-element lists of the form ``(1, |
| 146 | 'name', 56)``. If the third element is present, it is assumed to be a valid |
| 147 | line number. The line number may be specified for any subset of the terminal |
| 148 | symbols in the input tree. |
| 149 | |
| 150 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 151 | .. function:: tuple2st(sequence) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 152 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 153 | This is the same function as :func:`sequence2st`. This entry point is |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 154 | maintained for backward compatibility. |
| 155 | |
| 156 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 157 | .. _converting-sts: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 158 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 159 | Converting ST Objects |
| 160 | --------------------- |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 161 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 162 | ST objects, regardless of the input used to create them, may be converted to |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 163 | parse trees represented as list- or tuple- trees, or may be compiled into |
| 164 | executable code objects. Parse trees may be extracted with or without line |
| 165 | numbering information. |
| 166 | |
| 167 | |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 168 | .. function:: st2list(st[, line_info]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 169 | |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 170 | This function accepts an ST object from the caller in *st* and returns a |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 171 | Python list representing the equivalent parse tree. The resulting list |
| 172 | representation can be used for inspection or the creation of a new parse tree in |
| 173 | list form. This function does not fail so long as memory is available to build |
| 174 | the list representation. If the parse tree will only be used for inspection, |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 175 | :func:`st2tuple` should be used instead to reduce memory consumption and |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 176 | fragmentation. When the list representation is required, this function is |
| 177 | significantly faster than retrieving a tuple representation and converting that |
| 178 | to nested lists. |
| 179 | |
| 180 | If *line_info* is true, line number information will be included for all |
| 181 | terminal tokens as a third element of the list representing the token. Note |
| 182 | that the line number provided specifies the line on which the token *ends*. |
| 183 | This information is omitted if the flag is false or omitted. |
| 184 | |
| 185 | |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 186 | .. function:: st2tuple(st[, line_info]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 187 | |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 188 | This function accepts an ST object from the caller in *st* and returns a |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 189 | Python tuple representing the equivalent parse tree. Other than returning a |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 190 | tuple instead of a list, this function is identical to :func:`st2list`. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 191 | |
| 192 | If *line_info* is true, line number information will be included for all |
| 193 | terminal tokens as a third element of the list representing the token. This |
| 194 | information is omitted if the flag is false or omitted. |
| 195 | |
| 196 | |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 197 | .. function:: compilest(st[, filename='<syntax-tree>']) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 198 | |
| 199 | .. index:: |
| 200 | builtin: exec |
| 201 | builtin: eval |
| 202 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 203 | The Python byte compiler can be invoked on an ST object to produce code objects |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 204 | which can be used as part of a call to the built-in :func:`exec` or :func:`eval` |
| 205 | functions. This function provides the interface to the compiler, passing the |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 206 | internal parse tree from *st* to the parser, using the source file name |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 207 | specified by the *filename* parameter. The default value supplied for *filename* |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 208 | indicates that the source was an ST object. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 209 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 210 | Compiling an ST object may result in exceptions related to compilation; an |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 211 | example would be a :exc:`SyntaxError` caused by the parse tree for ``del f(0)``: |
| 212 | this statement is considered legal within the formal grammar for Python but is |
| 213 | not a legal language construct. The :exc:`SyntaxError` raised for this |
| 214 | condition is actually generated by the Python byte-compiler normally, which is |
| 215 | why it can be raised at this point by the :mod:`parser` module. Most causes of |
| 216 | compilation failure can be diagnosed programmatically by inspection of the parse |
| 217 | tree. |
| 218 | |
| 219 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 220 | .. _querying-sts: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 221 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 222 | Queries on ST Objects |
| 223 | --------------------- |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 224 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 225 | Two functions are provided which allow an application to determine if an ST was |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 226 | created as an expression or a suite. Neither of these functions can be used to |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 227 | determine if an ST was created from source code via :func:`expr` or |
| 228 | :func:`suite` or from a parse tree via :func:`sequence2st`. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 229 | |
| 230 | |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 231 | .. function:: isexpr(st) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 232 | |
| 233 | .. index:: builtin: compile |
| 234 | |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 235 | When *st* represents an ``'eval'`` form, this function returns true, otherwise |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 236 | it returns false. This is useful, since code objects normally cannot be queried |
| 237 | for this information using existing built-in functions. Note that the code |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 238 | objects created by :func:`compilest` cannot be queried like this either, and |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 239 | are identical to those created by the built-in :func:`compile` function. |
| 240 | |
| 241 | |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 242 | .. function:: issuite(st) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 243 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 244 | This function mirrors :func:`isexpr` in that it reports whether an ST object |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 245 | represents an ``'exec'`` form, commonly known as a "suite." It is not safe to |
Georg Brandl | 30704ea0 | 2008-07-23 15:07:12 +0000 | [diff] [blame] | 246 | assume that this function is equivalent to ``not isexpr(st)``, as additional |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 247 | syntactic fragments may be supported in the future. |
| 248 | |
| 249 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 250 | .. _st-errors: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 251 | |
| 252 | Exceptions and Error Handling |
| 253 | ----------------------------- |
| 254 | |
| 255 | The parser module defines a single exception, but may also pass other built-in |
| 256 | exceptions from other portions of the Python runtime environment. See each |
| 257 | function for information about the exceptions it can raise. |
| 258 | |
| 259 | |
| 260 | .. exception:: ParserError |
| 261 | |
| 262 | Exception raised when a failure occurs within the parser module. This is |
| 263 | generally produced for validation failures rather than the built in |
| 264 | :exc:`SyntaxError` thrown during normal parsing. The exception argument is |
| 265 | either a string describing the reason of the failure or a tuple containing a |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 266 | sequence causing the failure from a parse tree passed to :func:`sequence2st` |
| 267 | and an explanatory string. Calls to :func:`sequence2st` need to be able to |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 268 | handle either type of exception, while calls to other functions in the module |
| 269 | will only need to be aware of the simple string values. |
| 270 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 271 | Note that the functions :func:`compilest`, :func:`expr`, and :func:`suite` may |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 272 | throw exceptions which are normally thrown by the parsing and compilation |
| 273 | process. These include the built in exceptions :exc:`MemoryError`, |
| 274 | :exc:`OverflowError`, :exc:`SyntaxError`, and :exc:`SystemError`. In these |
| 275 | cases, these exceptions carry all the meaning normally associated with them. |
| 276 | Refer to the descriptions of each function for detailed information. |
| 277 | |
| 278 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 279 | .. _st-objects: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 280 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 281 | ST Objects |
| 282 | ---------- |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 283 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 284 | Ordered and equality comparisons are supported between ST objects. Pickling of |
| 285 | ST objects (using the :mod:`pickle` module) is also supported. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 286 | |
| 287 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 288 | .. data:: STType |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 289 | |
| 290 | The type of the objects returned by :func:`expr`, :func:`suite` and |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 291 | :func:`sequence2st`. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 292 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 293 | ST objects have the following methods: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 294 | |
| 295 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 296 | .. method:: ST.compile([filename]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 297 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 298 | Same as ``compilest(st, filename)``. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 299 | |
| 300 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 301 | .. method:: ST.isexpr() |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 302 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 303 | Same as ``isexpr(st)``. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 304 | |
| 305 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 306 | .. method:: ST.issuite() |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 307 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 308 | Same as ``issuite(st)``. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 309 | |
| 310 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 311 | .. method:: ST.tolist([line_info]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 312 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 313 | Same as ``st2list(st, line_info)``. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 314 | |
| 315 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 316 | .. method:: ST.totuple([line_info]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 317 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 318 | Same as ``st2tuple(st, line_info)``. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 319 | |
| 320 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 321 | .. _st-examples: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 322 | |
| 323 | Examples |
| 324 | -------- |
| 325 | |
| 326 | .. index:: builtin: compile |
| 327 | |
| 328 | The parser modules allows operations to be performed on the parse tree of Python |
Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 329 | source code before the :term:`bytecode` is generated, and provides for inspection of the |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 330 | parse tree for information gathering purposes. Two examples are presented. The |
| 331 | simple example demonstrates emulation of the :func:`compile` built-in function |
| 332 | and the complex example shows the use of a parse tree for information discovery. |
| 333 | |
| 334 | |
| 335 | Emulation of :func:`compile` |
| 336 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 337 | |
| 338 | While many useful operations may take place between parsing and bytecode |
| 339 | generation, the simplest operation is to do nothing. For this purpose, using |
| 340 | the :mod:`parser` module to produce an intermediate data structure is equivalent |
| 341 | to the code :: |
| 342 | |
| 343 | >>> code = compile('a + 5', 'file.py', 'eval') |
| 344 | >>> a = 5 |
| 345 | >>> eval(code) |
| 346 | 10 |
| 347 | |
| 348 | The equivalent operation using the :mod:`parser` module is somewhat longer, and |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 349 | allows the intermediate internal parse tree to be retained as an ST object:: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 350 | |
| 351 | >>> import parser |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 352 | >>> st = parser.expr('a + 5') |
| 353 | >>> code = st.compile('file.py') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 354 | >>> a = 5 |
| 355 | >>> eval(code) |
| 356 | 10 |
| 357 | |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 358 | An application which needs both ST and code objects can package this code into |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 359 | readily available functions:: |
| 360 | |
| 361 | import parser |
| 362 | |
| 363 | def load_suite(source_string): |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 364 | st = parser.suite(source_string) |
| 365 | return st, st.compile() |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 366 | |
| 367 | def load_expression(source_string): |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 368 | st = parser.expr(source_string) |
| 369 | return st, st.compile() |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 370 | |
| 371 | |
| 372 | Information Discovery |
| 373 | ^^^^^^^^^^^^^^^^^^^^^ |
| 374 | |
| 375 | .. index:: |
| 376 | single: string; documentation |
| 377 | single: docstrings |
| 378 | |
| 379 | Some applications benefit from direct access to the parse tree. The remainder |
| 380 | of this section demonstrates how the parse tree provides access to module |
| 381 | documentation defined in docstrings without requiring that the code being |
| 382 | examined be loaded into a running interpreter via :keyword:`import`. This can |
| 383 | be very useful for performing analyses of untrusted code. |
| 384 | |
| 385 | Generally, the example will demonstrate how the parse tree may be traversed to |
| 386 | distill interesting information. Two functions and a set of classes are |
| 387 | developed which provide programmatic access to high level function and class |
| 388 | definitions provided by a module. The classes extract information from the |
| 389 | parse tree and provide access to the information at a useful semantic level, one |
| 390 | function provides a simple low-level pattern matching capability, and the other |
| 391 | function defines a high-level interface to the classes by handling file |
| 392 | operations on behalf of the caller. All source files mentioned here which are |
| 393 | not part of the Python installation are located in the :file:`Demo/parser/` |
| 394 | directory of the distribution. |
| 395 | |
| 396 | The dynamic nature of Python allows the programmer a great deal of flexibility, |
| 397 | but most modules need only a limited measure of this when defining classes, |
| 398 | functions, and methods. In this example, the only definitions that will be |
| 399 | considered are those which are defined in the top level of their context, e.g., |
| 400 | a function defined by a :keyword:`def` statement at column zero of a module, but |
| 401 | not a function defined within a branch of an :keyword:`if` ... :keyword:`else` |
| 402 | construct, though there are some good reasons for doing so in some situations. |
| 403 | Nesting of definitions will be handled by the code developed in the example. |
| 404 | |
| 405 | To construct the upper-level extraction methods, we need to know what the parse |
| 406 | tree structure looks like and how much of it we actually need to be concerned |
| 407 | about. Python uses a moderately deep parse tree so there are a large number of |
| 408 | intermediate nodes. It is important to read and understand the formal grammar |
| 409 | used by Python. This is specified in the file :file:`Grammar/Grammar` in the |
| 410 | distribution. Consider the simplest case of interest when searching for |
| 411 | docstrings: a module consisting of a docstring and nothing else. (See file |
| 412 | :file:`docstring.py`.) :: |
| 413 | |
| 414 | """Some documentation. |
| 415 | """ |
| 416 | |
| 417 | Using the interpreter to take a look at the parse tree, we find a bewildering |
| 418 | mass of numbers and parentheses, with the documentation buried deep in nested |
| 419 | tuples. :: |
| 420 | |
| 421 | >>> import parser |
| 422 | >>> import pprint |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 423 | >>> st = parser.suite(open('docstring.py').read()) |
| 424 | >>> tup = st.totuple() |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 425 | >>> pprint.pprint(tup) |
| 426 | (257, |
| 427 | (264, |
| 428 | (265, |
| 429 | (266, |
| 430 | (267, |
| 431 | (307, |
| 432 | (287, |
| 433 | (288, |
| 434 | (289, |
| 435 | (290, |
| 436 | (292, |
| 437 | (293, |
| 438 | (294, |
| 439 | (295, |
| 440 | (296, |
| 441 | (297, |
| 442 | (298, |
| 443 | (299, |
| 444 | (300, (3, '"""Some documentation.\n"""'))))))))))))))))), |
| 445 | (4, ''))), |
| 446 | (4, ''), |
| 447 | (0, '')) |
| 448 | |
| 449 | The numbers at the first element of each node in the tree are the node types; |
| 450 | they map directly to terminal and non-terminal symbols in the grammar. |
| 451 | Unfortunately, they are represented as integers in the internal representation, |
| 452 | and the Python structures generated do not change that. However, the |
| 453 | :mod:`symbol` and :mod:`token` modules provide symbolic names for the node types |
| 454 | and dictionaries which map from the integers to the symbolic names for the node |
| 455 | types. |
| 456 | |
| 457 | In the output presented above, the outermost tuple contains four elements: the |
| 458 | integer ``257`` and three additional tuples. Node type ``257`` has the symbolic |
| 459 | name :const:`file_input`. Each of these inner tuples contains an integer as the |
| 460 | first element; these integers, ``264``, ``4``, and ``0``, represent the node |
| 461 | types :const:`stmt`, :const:`NEWLINE`, and :const:`ENDMARKER`, respectively. |
| 462 | Note that these values may change depending on the version of Python you are |
| 463 | using; consult :file:`symbol.py` and :file:`token.py` for details of the |
| 464 | mapping. It should be fairly clear that the outermost node is related primarily |
| 465 | to the input source rather than the contents of the file, and may be disregarded |
| 466 | for the moment. The :const:`stmt` node is much more interesting. In |
| 467 | particular, all docstrings are found in subtrees which are formed exactly as |
| 468 | this node is formed, with the only difference being the string itself. The |
| 469 | association between the docstring in a similar tree and the defined entity |
| 470 | (class, function, or module) which it describes is given by the position of the |
| 471 | docstring subtree within the tree defining the described structure. |
| 472 | |
| 473 | By replacing the actual docstring with something to signify a variable component |
| 474 | of the tree, we allow a simple pattern matching approach to check any given |
| 475 | subtree for equivalence to the general pattern for docstrings. Since the |
| 476 | example demonstrates information extraction, we can safely require that the tree |
| 477 | be in tuple form rather than list form, allowing a simple variable |
| 478 | representation to be ``['variable_name']``. A simple recursive function can |
| 479 | implement the pattern matching, returning a Boolean and a dictionary of variable |
| 480 | name to value mappings. (See file :file:`example.py`.) :: |
| 481 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 482 | def match(pattern, data, vars=None): |
| 483 | if vars is None: |
| 484 | vars = {} |
Collin Winter | 1b1498b | 2007-08-28 06:10:19 +0000 | [diff] [blame] | 485 | if isinstance(pattern, list): |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 486 | vars[pattern[0]] = data |
Collin Winter | 1b1498b | 2007-08-28 06:10:19 +0000 | [diff] [blame] | 487 | return True, vars |
| 488 | if not instance(pattern, tuple): |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 489 | return (pattern == data), vars |
| 490 | if len(data) != len(pattern): |
Collin Winter | 1b1498b | 2007-08-28 06:10:19 +0000 | [diff] [blame] | 491 | return False, vars |
| 492 | for pattern, data in zip(pattern, data): |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 493 | same, vars = match(pattern, data, vars) |
| 494 | if not same: |
| 495 | break |
| 496 | return same, vars |
| 497 | |
| 498 | Using this simple representation for syntactic variables and the symbolic node |
| 499 | types, the pattern for the candidate docstring subtrees becomes fairly readable. |
| 500 | (See file :file:`example.py`.) :: |
| 501 | |
| 502 | import symbol |
| 503 | import token |
| 504 | |
| 505 | DOCSTRING_STMT_PATTERN = ( |
| 506 | symbol.stmt, |
| 507 | (symbol.simple_stmt, |
| 508 | (symbol.small_stmt, |
| 509 | (symbol.expr_stmt, |
| 510 | (symbol.testlist, |
| 511 | (symbol.test, |
| 512 | (symbol.and_test, |
| 513 | (symbol.not_test, |
| 514 | (symbol.comparison, |
| 515 | (symbol.expr, |
| 516 | (symbol.xor_expr, |
| 517 | (symbol.and_expr, |
| 518 | (symbol.shift_expr, |
| 519 | (symbol.arith_expr, |
| 520 | (symbol.term, |
| 521 | (symbol.factor, |
| 522 | (symbol.power, |
| 523 | (symbol.atom, |
| 524 | (token.STRING, ['docstring']) |
| 525 | )))))))))))))))), |
| 526 | (token.NEWLINE, '') |
| 527 | )) |
| 528 | |
| 529 | Using the :func:`match` function with this pattern, extracting the module |
| 530 | docstring from the parse tree created previously is easy:: |
| 531 | |
| 532 | >>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1]) |
| 533 | >>> found |
Collin Winter | 1b1498b | 2007-08-28 06:10:19 +0000 | [diff] [blame] | 534 | True |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 535 | >>> vars |
| 536 | {'docstring': '"""Some documentation.\n"""'} |
| 537 | |
| 538 | Once specific data can be extracted from a location where it is expected, the |
| 539 | question of where information can be expected needs to be answered. When |
| 540 | dealing with docstrings, the answer is fairly simple: the docstring is the first |
| 541 | :const:`stmt` node in a code block (:const:`file_input` or :const:`suite` node |
| 542 | types). A module consists of a single :const:`file_input` node, and class and |
| 543 | function definitions each contain exactly one :const:`suite` node. Classes and |
| 544 | functions are readily identified as subtrees of code block nodes which start |
| 545 | with ``(stmt, (compound_stmt, (classdef, ...`` or ``(stmt, (compound_stmt, |
| 546 | (funcdef, ...``. Note that these subtrees cannot be matched by :func:`match` |
| 547 | since it does not support multiple sibling nodes to match without regard to |
| 548 | number. A more elaborate matching function could be used to overcome this |
| 549 | limitation, but this is sufficient for the example. |
| 550 | |
| 551 | Given the ability to determine whether a statement might be a docstring and |
| 552 | extract the actual string from the statement, some work needs to be performed to |
| 553 | walk the parse tree for an entire module and extract information about the names |
| 554 | defined in each context of the module and associate any docstrings with the |
| 555 | names. The code to perform this work is not complicated, but bears some |
| 556 | explanation. |
| 557 | |
| 558 | The public interface to the classes is straightforward and should probably be |
| 559 | somewhat more flexible. Each "major" block of the module is described by an |
| 560 | object providing several methods for inquiry and a constructor which accepts at |
| 561 | least the subtree of the complete parse tree which it represents. The |
| 562 | :class:`ModuleInfo` constructor accepts an optional *name* parameter since it |
| 563 | cannot otherwise determine the name of the module. |
| 564 | |
| 565 | The public classes include :class:`ClassInfo`, :class:`FunctionInfo`, and |
| 566 | :class:`ModuleInfo`. All objects provide the methods :meth:`get_name`, |
| 567 | :meth:`get_docstring`, :meth:`get_class_names`, and :meth:`get_class_info`. The |
| 568 | :class:`ClassInfo` objects support :meth:`get_method_names` and |
| 569 | :meth:`get_method_info` while the other classes provide |
| 570 | :meth:`get_function_names` and :meth:`get_function_info`. |
| 571 | |
| 572 | Within each of the forms of code block that the public classes represent, most |
| 573 | of the required information is in the same form and is accessed in the same way, |
| 574 | with classes having the distinction that functions defined at the top level are |
| 575 | referred to as "methods." Since the difference in nomenclature reflects a real |
| 576 | semantic distinction from functions defined outside of a class, the |
| 577 | implementation needs to maintain the distinction. Hence, most of the |
| 578 | functionality of the public classes can be implemented in a common base class, |
| 579 | :class:`SuiteInfoBase`, with the accessors for function and method information |
| 580 | provided elsewhere. Note that there is only one class which represents function |
| 581 | and method information; this parallels the use of the :keyword:`def` statement |
| 582 | to define both types of elements. |
| 583 | |
| 584 | Most of the accessor functions are declared in :class:`SuiteInfoBase` and do not |
| 585 | need to be overridden by subclasses. More importantly, the extraction of most |
| 586 | information from a parse tree is handled through a method called by the |
| 587 | :class:`SuiteInfoBase` constructor. The example code for most of the classes is |
| 588 | clear when read alongside the formal grammar, but the method which recursively |
| 589 | creates new information objects requires further examination. Here is the |
| 590 | relevant part of the :class:`SuiteInfoBase` definition from :file:`example.py`:: |
| 591 | |
| 592 | class SuiteInfoBase: |
| 593 | _docstring = '' |
| 594 | _name = '' |
| 595 | |
| 596 | def __init__(self, tree = None): |
| 597 | self._class_info = {} |
| 598 | self._function_info = {} |
| 599 | if tree: |
| 600 | self._extract_info(tree) |
| 601 | |
| 602 | def _extract_info(self, tree): |
| 603 | # extract docstring |
| 604 | if len(tree) == 2: |
| 605 | found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1]) |
| 606 | else: |
| 607 | found, vars = match(DOCSTRING_STMT_PATTERN, tree[3]) |
| 608 | if found: |
| 609 | self._docstring = eval(vars['docstring']) |
| 610 | # discover inner definitions |
| 611 | for node in tree[1:]: |
| 612 | found, vars = match(COMPOUND_STMT_PATTERN, node) |
| 613 | if found: |
| 614 | cstmt = vars['compound'] |
| 615 | if cstmt[0] == symbol.funcdef: |
| 616 | name = cstmt[2][1] |
| 617 | self._function_info[name] = FunctionInfo(cstmt) |
| 618 | elif cstmt[0] == symbol.classdef: |
| 619 | name = cstmt[2][1] |
| 620 | self._class_info[name] = ClassInfo(cstmt) |
| 621 | |
| 622 | After initializing some internal state, the constructor calls the |
| 623 | :meth:`_extract_info` method. This method performs the bulk of the information |
| 624 | extraction which takes place in the entire example. The extraction has two |
| 625 | distinct phases: the location of the docstring for the parse tree passed in, and |
| 626 | the discovery of additional definitions within the code block represented by the |
| 627 | parse tree. |
| 628 | |
| 629 | The initial :keyword:`if` test determines whether the nested suite is of the |
| 630 | "short form" or the "long form." The short form is used when the code block is |
| 631 | on the same line as the definition of the code block, as in :: |
| 632 | |
| 633 | def square(x): "Square an argument."; return x ** 2 |
| 634 | |
| 635 | while the long form uses an indented block and allows nested definitions:: |
| 636 | |
| 637 | def make_power(exp): |
Georg Brandl | 1f01deb | 2009-01-03 22:47:39 +0000 | [diff] [blame] | 638 | "Make a function that raises an argument to the exponent `exp`." |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 639 | def raiser(x, y=exp): |
| 640 | return x ** y |
| 641 | return raiser |
| 642 | |
| 643 | When the short form is used, the code block may contain a docstring as the |
| 644 | first, and possibly only, :const:`small_stmt` element. The extraction of such a |
| 645 | docstring is slightly different and requires only a portion of the complete |
| 646 | pattern used in the more common case. As implemented, the docstring will only |
| 647 | be found if there is only one :const:`small_stmt` node in the |
| 648 | :const:`simple_stmt` node. Since most functions and methods which use the short |
| 649 | form do not provide a docstring, this may be considered sufficient. The |
| 650 | extraction of the docstring proceeds using the :func:`match` function as |
| 651 | described above, and the value of the docstring is stored as an attribute of the |
| 652 | :class:`SuiteInfoBase` object. |
| 653 | |
| 654 | After docstring extraction, a simple definition discovery algorithm operates on |
| 655 | the :const:`stmt` nodes of the :const:`suite` node. The special case of the |
| 656 | short form is not tested; since there are no :const:`stmt` nodes in the short |
| 657 | form, the algorithm will silently skip the single :const:`simple_stmt` node and |
| 658 | correctly not discover any nested definitions. |
| 659 | |
| 660 | Each statement in the code block is categorized as a class definition, function |
| 661 | or method definition, or something else. For the definition statements, the |
| 662 | name of the element defined is extracted and a representation object appropriate |
| 663 | to the definition is created with the defining subtree passed as an argument to |
| 664 | the constructor. The representation objects are stored in instance variables |
| 665 | and may be retrieved by name using the appropriate accessor methods. |
| 666 | |
| 667 | The public classes provide any accessors required which are more specific than |
| 668 | those provided by the :class:`SuiteInfoBase` class, but the real extraction |
| 669 | algorithm remains common to all forms of code blocks. A high-level function can |
| 670 | be used to extract the complete set of information from a source file. (See |
| 671 | file :file:`example.py`.) :: |
| 672 | |
| 673 | def get_docs(fileName): |
| 674 | import os |
| 675 | import parser |
| 676 | |
| 677 | source = open(fileName).read() |
| 678 | basename = os.path.basename(os.path.splitext(fileName)[0]) |
Georg Brandl | 0c77a82 | 2008-06-10 16:37:50 +0000 | [diff] [blame] | 679 | st = parser.suite(source) |
| 680 | return ModuleInfo(st.totuple(), basename) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 681 | |
| 682 | This provides an easy-to-use interface to the documentation of a module. If |
| 683 | information is required which is not extracted by the code of this example, the |
| 684 | code may be extended at clearly defined points to provide additional |
| 685 | capabilities. |
| 686 | |