Fred Drake | e2f9917 | 2001-09-27 20:06:07 +0000 | [diff] [blame] | 1 | \chapter{Python compiler package \label{compiler}} |
| 2 | |
| 3 | \sectionauthor{Jeremy Hylton}{jeremy@zope.com} |
| 4 | |
| 5 | |
| 6 | The Python compiler package is a tool for analyzing Python source code |
| 7 | and generating Python bytecode. The compiler contains libraries to |
| 8 | generate an abstract syntax tree from Python source code and to |
| 9 | generate Python bytecode from the tree. |
| 10 | |
| 11 | The \refmodule{compiler} package is a Python source to bytecode |
| 12 | translator written in Python. It uses the built-in parser and |
| 13 | standard \refmodule{parser} module to generated a concrete syntax |
| 14 | tree. This tree is used to generate an abstract syntax tree (AST) and |
| 15 | then Python bytecode. |
| 16 | |
| 17 | The full functionality of the package duplicates the builtin compiler |
| 18 | provided with the Python interpreter. It is intended to match its |
| 19 | behavior almost exactly. Why implement another compiler that does the |
| 20 | same thing? The package is useful for a variety of purposes. It can |
| 21 | be modified more easily than the builtin compiler. The AST it |
| 22 | generates is useful for analyzing Python source code. |
| 23 | |
| 24 | This chapter explains how the various components of the |
| 25 | \refmodule{compiler} package work. It blends reference material with |
| 26 | a tutorial. |
| 27 | |
| 28 | The following modules are part of the \refmodule{compiler} package: |
| 29 | |
| 30 | \localmoduletable |
| 31 | |
| 32 | |
Fred Drake | 67bd683 | 2001-11-02 19:41:23 +0000 | [diff] [blame] | 33 | \section{The basic interface} |
Fred Drake | e2f9917 | 2001-09-27 20:06:07 +0000 | [diff] [blame] | 34 | |
| 35 | \declaremodule{}{compiler} |
| 36 | |
| 37 | The top-level of the package defines four functions. If you import |
| 38 | \module{compiler}, you will get these functions and a collection of |
| 39 | modules contained in the package. |
| 40 | |
| 41 | \begin{funcdesc}{parse}{buf} |
| 42 | Returns an abstract syntax tree for the Python source code in \var{buf}. |
| 43 | The function raises SyntaxError if there is an error in the source |
| 44 | code. The return value is a \class{compiler.ast.Module} instance that |
| 45 | contains the tree. |
| 46 | \end{funcdesc} |
| 47 | |
| 48 | \begin{funcdesc}{parseFile}{path} |
| 49 | Return an abstract syntax tree for the Python source code in the file |
| 50 | specified by \var{path}. It is equivalent to |
| 51 | \code{parse(open(\var{path}).read())}. |
| 52 | \end{funcdesc} |
| 53 | |
| 54 | \begin{funcdesc}{walk}{ast, visitor\optional{, verbose}} |
| 55 | Do a pre-order walk over the abstract syntax tree \var{ast}. Call the |
| 56 | appropriate method on the \var{visitor} instance for each node |
| 57 | encountered. |
| 58 | \end{funcdesc} |
| 59 | |
Jeremy Hylton | db5a93c | 2001-12-04 02:48:52 +0000 | [diff] [blame] | 60 | \begin{funcdesc}{compile}{source, filename, mode, flags=None, |
| 61 | dont_inherit=None} |
| 62 | Compile the string \var{source}, a Python module, statement or |
| 63 | expression, into a code object that can be executed by the exec |
| 64 | statement or \function{eval()}. This function is a replacement for the |
| 65 | built-in \function{compile()} function. |
| 66 | |
| 67 | The \var{filename} will be used for run-time error messages. |
| 68 | |
| 69 | The \var{mode} must be 'exec' to compile a module, 'single' to compile a |
| 70 | single (interactive) statement, or 'eval' to compile an expression. |
| 71 | |
| 72 | The \var{flags} and \var{dont_inherit} arguments affect future-related |
| 73 | statements, but are not supported yet. |
| 74 | \end{funcdesc} |
| 75 | |
| 76 | \begin{funcdesc}{compileFile}{source} |
| 77 | Compiles the file \var{source} and generates a .pyc file. |
Fred Drake | e2f9917 | 2001-09-27 20:06:07 +0000 | [diff] [blame] | 78 | \end{funcdesc} |
| 79 | |
| 80 | The \module{compiler} package contains the following modules: |
| 81 | \refmodule[compiler.ast]{ast}, \module{consts}, \module{future}, |
| 82 | \module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols}, |
| 83 | \module{transformer}, and \refmodule[compiler.visitor]{visitor}. |
| 84 | |
Fred Drake | 67bd683 | 2001-11-02 19:41:23 +0000 | [diff] [blame] | 85 | \section{Limitations} |
Fred Drake | e2f9917 | 2001-09-27 20:06:07 +0000 | [diff] [blame] | 86 | |
| 87 | There are some problems with the error checking of the compiler |
| 88 | package. The interpreter detects syntax errors in two distinct |
| 89 | phases. One set of errors is detected by the interpreter's parser, |
| 90 | the other set by the compiler. The compiler package relies on the |
| 91 | interpreter's parser, so it get the first phases of error checking for |
| 92 | free. It implements the second phase itself, and that implement is |
| 93 | incomplete. For example, the compiler package does not raise an error |
| 94 | if a name appears more than once in an argument list: |
| 95 | \code{def f(x, x): ...} |
| 96 | |
| 97 | A future version of the compiler should fix these problems. |
| 98 | |
| 99 | \section{Python Abstract Syntax} |
| 100 | |
| 101 | The \module{compiler.ast} module defines an abstract syntax for |
| 102 | Python. In the abstract syntax tree, each node represents a syntactic |
| 103 | construct. The root of the tree is \class{Module} object. |
| 104 | |
| 105 | The abstract syntax offers a higher level interface to parsed Python |
| 106 | source code. The \ulink{\module{parser}} |
| 107 | {http://www.python.org/doc/current/lib/module-parser.html} |
| 108 | module and the compiler written in C for the Python interpreter use a |
| 109 | concrete syntax tree. The concrete syntax is tied closely to the |
| 110 | grammar description used for the Python parser. Instead of a single |
| 111 | node for a construct, there are often several levels of nested nodes |
| 112 | that are introduced by Python's precedence rules. |
| 113 | |
| 114 | The abstract syntax tree is created by the |
| 115 | \module{compiler.transformer} module. The transformer relies on the |
| 116 | builtin Python parser to generate a concrete syntax tree. It |
| 117 | generates an abstract syntax tree from the concrete tree. |
| 118 | |
| 119 | The \module{transformer} module was created by Greg |
| 120 | Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an |
| 121 | experimental Python-to-C compiler. The current version contains a |
| 122 | number of modifications and improvements, but the basic form of the |
| 123 | abstract syntax and of the transformer are due to Stein and Tutt. |
| 124 | |
| 125 | \subsection{AST Nodes} |
| 126 | |
| 127 | \declaremodule{}{compiler.ast} |
| 128 | |
| 129 | The \module{compiler.ast} module is generated from a text file that |
| 130 | describes each node type and its elements. Each node type is |
| 131 | represented as a class that inherits from the abstract base class |
| 132 | \class{compiler.ast.Node} and defines a set of named attributes for |
| 133 | child nodes. |
| 134 | |
| 135 | \begin{classdesc}{Node}{} |
| 136 | |
| 137 | The \class{Node} instances are created automatically by the parser |
| 138 | generator. The recommended interface for specific \class{Node} |
| 139 | instances is to use the public attributes to access child nodes. A |
| 140 | public attribute may be bound to a single node or to a sequence of |
| 141 | nodes, depending on the \class{Node} type. For example, the |
| 142 | \member{bases} attribute of the \class{Class} node, is bound to a |
| 143 | list of base class nodes, and the \member{doc} attribute is bound to |
| 144 | a single node. |
| 145 | |
| 146 | Each \class{Node} instance has a \member{lineno} attribute which may |
| 147 | be \code{None}. XXX Not sure what the rules are for which nodes |
| 148 | will have a useful lineno. |
| 149 | \end{classdesc} |
| 150 | |
| 151 | All \class{Node} objects offer the following methods: |
| 152 | |
| 153 | \begin{methoddesc}{getChildren}{} |
| 154 | Returns a flattened list of the child nodes and objects in the |
| 155 | order they occur. Specifically, the order of the nodes is the |
| 156 | order in which they appear in the Python grammar. Not all of the |
| 157 | children are \class{Node} instances. The names of functions and |
| 158 | classes, for example, are plain strings. |
| 159 | \end{methoddesc} |
| 160 | |
| 161 | \begin{methoddesc}{getChildNodes}{} |
| 162 | Returns a flattened list of the child nodes in the order they |
| 163 | occur. This method is like \method{getChildren()}, except that it |
| 164 | only returns those children that are \class{Node} instances. |
| 165 | \end{methoddesc} |
| 166 | |
| 167 | Two examples illustrate the general structure of \class{Node} |
| 168 | classes. The \keyword{while} statement is defined by the following |
| 169 | grammar production: |
| 170 | |
| 171 | \begin{verbatim} |
| 172 | while_stmt: "while" expression ":" suite |
| 173 | ["else" ":" suite] |
| 174 | \end{verbatim} |
| 175 | |
| 176 | The \class{While} node has three attributes: \member{test}, |
| 177 | \member{body}, and \member{else_}. (If the natural name for an |
| 178 | attribute is also a Python reserved word, it can't be used as an |
| 179 | attribute name. An underscore is appended to the word to make it a |
| 180 | legal identifier, hence \member{else_} instead of \keyword{else}.) |
| 181 | |
| 182 | The \keyword{if} statement is more complicated because it can include |
| 183 | several tests. |
| 184 | |
| 185 | \begin{verbatim} |
| 186 | if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite] |
| 187 | \end{verbatim} |
| 188 | |
| 189 | The \class{If} node only defines two attributes: \member{tests} and |
| 190 | \member{else_}. The \member{tests} attribute is a sequence of test |
| 191 | expression, consequent body pairs. There is one pair for each |
| 192 | \keyword{if}/\keyword{elif} clause. The first element of the pair is |
| 193 | the test expression. The second elements is a \class{Stmt} node that |
| 194 | contains the code to execute if the test is true. |
| 195 | |
| 196 | The \method{getChildren()} method of \class{If} returns a flat list of |
| 197 | child nodes. If there are three \keyword{if}/\keyword{elif} clauses |
| 198 | and no \keyword{else} clause, then \method{getChildren()} will return |
| 199 | a list of six elements: the first test expression, the first |
| 200 | \class{Stmt}, the second text expression, etc. |
| 201 | |
| 202 | The following table lists each of the \class{Node} subclasses defined |
| 203 | in \module{compiler.ast} and each of the public attributes available |
| 204 | on their instances. The values of most of the attributes are |
| 205 | themselves \class{Node} instances or sequences of instances. When the |
| 206 | value is something other than an instance, the type is noted in the |
| 207 | comment. The attributes are listed in the order in which they are |
| 208 | returned by \method{getChildren()} and \method{getChildNodes()}. |
| 209 | |
| 210 | \input{asttable} |
| 211 | |
| 212 | |
| 213 | \subsection{Assignment nodes} |
| 214 | |
| 215 | There is a collection of nodes used to represent assignments. Each |
| 216 | assignment statement in the source code becomes a single |
| 217 | \class{Assign} node in the AST. The \member{nodes} attribute is a |
| 218 | list that contains a node for each assignment target. This is |
| 219 | necessary because assignment can be chained, e.g. \code{a = b = 2}. |
| 220 | Each \class{Node} in the list will be one of the following classes: |
| 221 | \class{AssAttr}, \class{AssList}, \class{AssName}, or |
| 222 | \class{AssTuple}. |
| 223 | |
| 224 | Each target assignment node will describe the kind of object being |
| 225 | assigned to: \class{AssName} for a simple name, e.g. \code{a = 1}. |
| 226 | \class{AssAttr} for an attribute assigned, e.g. \code{a.x = 1}. |
| 227 | \class{AssList} and \class{AssTuple} for list and tuple expansion |
| 228 | respectively, e.g. \code{a, b, c = a_tuple}. |
| 229 | |
| 230 | The target assignment nodes also have a \member{flags} attribute that |
| 231 | indicates whether the node is being used for assignment or in a delete |
| 232 | statement. The \class{AssName} is also used to represent a delete |
| 233 | statement, e.g. \class{del x}. |
| 234 | |
| 235 | When an expression contains several attribute references, an |
| 236 | assignment or delete statement will contain only one \class{AssAttr} |
| 237 | node -- for the final attribute reference. The other attribute |
| 238 | references will be represented as \class{Getattr} nodes in the |
| 239 | \member{expr} attribute of the \class{AssAttr} instance. |
| 240 | |
| 241 | \subsection{Examples} |
| 242 | |
| 243 | This section shows several simple examples of ASTs for Python source |
| 244 | code. The examples demonstrate how to use the \function{parse()} |
| 245 | function, what the repr of an AST looks like, and how to access |
| 246 | attributes of an AST node. |
| 247 | |
| 248 | The first module defines a single function. Assume it is stored in |
| 249 | \file{/tmp/doublelib.py}. |
| 250 | |
| 251 | \begin{verbatim} |
| 252 | """This is an example module. |
| 253 | |
| 254 | This is the docstring. |
| 255 | """ |
| 256 | |
| 257 | def double(x): |
| 258 | "Return twice the argument" |
| 259 | return x * 2 |
| 260 | \end{verbatim} |
| 261 | |
| 262 | In the interactive interpreter session below, I have reformatted the |
| 263 | long AST reprs for readability. The AST reprs use unqualified class |
| 264 | names. If you want to create an instance from a repr, you must import |
| 265 | the class names from the \module{compiler.ast} module. |
| 266 | |
| 267 | \begin{verbatim} |
| 268 | >>> import compiler |
| 269 | >>> mod = compiler.parseFile("/tmp/doublelib.py") |
| 270 | >>> mod |
| 271 | Module('This is an example module.\n\nThis is the docstring.\n', |
| 272 | Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', |
| 273 | Stmt([Return(Mul((Name('x'), Const(2))))]))])) |
| 274 | >>> from compiler.ast import * |
| 275 | >>> Module('This is an example module.\n\nThis is the docstring.\n', |
| 276 | ... Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', |
| 277 | ... Stmt([Return(Mul((Name('x'), Const(2))))]))])) |
| 278 | Module('This is an example module.\n\nThis is the docstring.\n', |
| 279 | Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', |
| 280 | Stmt([Return(Mul((Name('x'), Const(2))))]))])) |
| 281 | >>> mod.doc |
| 282 | 'This is an example module.\n\nThis is the docstring.\n' |
| 283 | >>> for node in mod.node.nodes: |
| 284 | ... print node |
| 285 | ... |
| 286 | Function('double', ['x'], [], 0, 'Return twice the argument', |
| 287 | Stmt([Return(Mul((Name('x'), Const(2))))])) |
| 288 | >>> func = mod.node.nodes[0] |
| 289 | >>> func.code |
| 290 | Stmt([Return(Mul((Name('x'), Const(2))))]) |
| 291 | \end{verbatim} |
| 292 | |
| 293 | \section{Using Visitors to Walk ASTs} |
| 294 | |
| 295 | \declaremodule{}{compiler.visitor} |
| 296 | |
| 297 | The visitor pattern is ... The \refmodule{compiler} package uses a |
| 298 | variant on the visitor pattern that takes advantage of Python's |
| 299 | introspection features to elminiate the need for much of the visitor's |
| 300 | infrastructure. |
| 301 | |
| 302 | The classes being visited do not need to be programmed to accept |
| 303 | visitors. The visitor need only define visit methods for classes it |
| 304 | is specifically interested in; a default visit method can handle the |
| 305 | rest. |
| 306 | |
| 307 | XXX The magic \method{visit()} method for visitors. |
| 308 | |
| 309 | \begin{funcdesc}{walk}{tree, visitor\optional{, verbose}} |
| 310 | \end{funcdesc} |
| 311 | |
| 312 | \begin{classdesc}{ASTVisitor}{} |
| 313 | |
| 314 | The \class{ASTVisitor} is responsible for walking over the tree in the |
| 315 | correct order. A walk begins with a call to \method{preorder()}. For |
| 316 | each node, it checks the \var{visitor} argument to \method{preorder()} |
| 317 | for a method named `visitNodeType,' where NodeType is the name of the |
| 318 | node's class, e.g. for a \class{While} node a \method{visitWhile()} |
| 319 | would be called. If the method exists, it is called with the node as |
| 320 | its first argument. |
| 321 | |
| 322 | The visitor method for a particular node type can control how child |
| 323 | nodes are visited during the walk. The \class{ASTVisitor} modifies |
| 324 | the visitor argument by adding a visit method to the visitor; this |
| 325 | method can be used to visit a particular child node. If no visitor is |
| 326 | found for a particular node type, the \method{default()} method is |
| 327 | called. |
| 328 | \end{classdesc} |
| 329 | |
| 330 | \class{ASTVisitor} objects have the following methods: |
| 331 | |
| 332 | XXX describe extra arguments |
| 333 | |
| 334 | \begin{methoddesc}{default}{node\optional{, \moreargs}} |
| 335 | \end{methoddesc} |
| 336 | |
| 337 | \begin{methoddesc}{dispatch}{node\optional{, \moreargs}} |
| 338 | \end{methoddesc} |
| 339 | |
| 340 | \begin{methoddesc}{preorder}{tree, visitor} |
| 341 | \end{methoddesc} |
| 342 | |
| 343 | |
| 344 | \section{Bytecode Generation} |
| 345 | |
| 346 | The code generator is a visitor that emits bytecodes. Each visit method |
| 347 | can call the \method{emit()} method to emit a new bytecode. The basic |
| 348 | code generator is specialized for modules, classes, and functions. An |
| 349 | assembler converts that emitted instructions to the low-level bytecode |
| 350 | format. It handles things like generator of constant lists of code |
| 351 | objects and calculation of jump offsets. |