| \chapter{Python compiler package \label{compiler}} |
| |
| \sectionauthor{Jeremy Hylton}{jeremy@zope.com} |
| |
| |
| The Python compiler package is a tool for analyzing Python source code |
| and generating Python bytecode. The compiler contains libraries to |
| generate an abstract syntax tree from Python source code and to |
| generate Python bytecode from the tree. |
| |
| The \refmodule{compiler} package is a Python source to bytecode |
| translator written in Python. It uses the built-in parser and |
| standard \refmodule{parser} module to generated a concrete syntax |
| tree. This tree is used to generate an abstract syntax tree (AST) and |
| then Python bytecode. |
| |
| The full functionality of the package duplicates the builtin compiler |
| provided with the Python interpreter. It is intended to match its |
| behavior almost exactly. Why implement another compiler that does the |
| same thing? The package is useful for a variety of purposes. It can |
| be modified more easily than the builtin compiler. The AST it |
| generates is useful for analyzing Python source code. |
| |
| This chapter explains how the various components of the |
| \refmodule{compiler} package work. It blends reference material with |
| a tutorial. |
| |
| The following modules are part of the \refmodule{compiler} package: |
| |
| \localmoduletable |
| |
| |
| \section{The basic interface} |
| |
| \declaremodule{}{compiler} |
| |
| The top-level of the package defines four functions. If you import |
| \module{compiler}, you will get these functions and a collection of |
| modules contained in the package. |
| |
| \begin{funcdesc}{parse}{buf} |
| Returns an abstract syntax tree for the Python source code in \var{buf}. |
| The function raises SyntaxError if there is an error in the source |
| code. The return value is a \class{compiler.ast.Module} instance that |
| contains the tree. |
| \end{funcdesc} |
| |
| \begin{funcdesc}{parseFile}{path} |
| Return an abstract syntax tree for the Python source code in the file |
| specified by \var{path}. It is equivalent to |
| \code{parse(open(\var{path}).read())}. |
| \end{funcdesc} |
| |
| \begin{funcdesc}{walk}{ast, visitor\optional{, verbose}} |
| Do a pre-order walk over the abstract syntax tree \var{ast}. Call the |
| appropriate method on the \var{visitor} instance for each node |
| encountered. |
| \end{funcdesc} |
| |
| \begin{funcdesc}{compile}{source, filename, mode, flags=None, |
| dont_inherit=None} |
| Compile the string \var{source}, a Python module, statement or |
| expression, into a code object that can be executed by the exec |
| statement or \function{eval()}. This function is a replacement for the |
| built-in \function{compile()} function. |
| |
| The \var{filename} will be used for run-time error messages. |
| |
| The \var{mode} must be 'exec' to compile a module, 'single' to compile a |
| single (interactive) statement, or 'eval' to compile an expression. |
| |
| The \var{flags} and \var{dont_inherit} arguments affect future-related |
| statements, but are not supported yet. |
| \end{funcdesc} |
| |
| \begin{funcdesc}{compileFile}{source} |
| Compiles the file \var{source} and generates a .pyc file. |
| \end{funcdesc} |
| |
| The \module{compiler} package contains the following modules: |
| \refmodule[compiler.ast]{ast}, \module{consts}, \module{future}, |
| \module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols}, |
| \module{transformer}, and \refmodule[compiler.visitor]{visitor}. |
| |
| \section{Limitations} |
| |
| There are some problems with the error checking of the compiler |
| package. The interpreter detects syntax errors in two distinct |
| phases. One set of errors is detected by the interpreter's parser, |
| the other set by the compiler. The compiler package relies on the |
| interpreter's parser, so it get the first phases of error checking for |
| free. It implements the second phase itself, and that implement is |
| incomplete. For example, the compiler package does not raise an error |
| if a name appears more than once in an argument list: |
| \code{def f(x, x): ...} |
| |
| A future version of the compiler should fix these problems. |
| |
| \section{Python Abstract Syntax} |
| |
| The \module{compiler.ast} module defines an abstract syntax for |
| Python. In the abstract syntax tree, each node represents a syntactic |
| construct. The root of the tree is \class{Module} object. |
| |
| The abstract syntax offers a higher level interface to parsed Python |
| source code. The \ulink{\module{parser}} |
| {http://www.python.org/doc/current/lib/module-parser.html} |
| module and the compiler written in C for the Python interpreter use a |
| concrete syntax tree. The concrete syntax is tied closely to the |
| grammar description used for the Python parser. Instead of a single |
| node for a construct, there are often several levels of nested nodes |
| that are introduced by Python's precedence rules. |
| |
| The abstract syntax tree is created by the |
| \module{compiler.transformer} module. The transformer relies on the |
| builtin Python parser to generate a concrete syntax tree. It |
| generates an abstract syntax tree from the concrete tree. |
| |
| The \module{transformer} module was created by Greg |
| Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an |
| experimental Python-to-C compiler. The current version contains a |
| number of modifications and improvements, but the basic form of the |
| abstract syntax and of the transformer are due to Stein and Tutt. |
| |
| \subsection{AST Nodes} |
| |
| \declaremodule{}{compiler.ast} |
| |
| The \module{compiler.ast} module is generated from a text file that |
| describes each node type and its elements. Each node type is |
| represented as a class that inherits from the abstract base class |
| \class{compiler.ast.Node} and defines a set of named attributes for |
| child nodes. |
| |
| \begin{classdesc}{Node}{} |
| |
| The \class{Node} instances are created automatically by the parser |
| generator. The recommended interface for specific \class{Node} |
| instances is to use the public attributes to access child nodes. A |
| public attribute may be bound to a single node or to a sequence of |
| nodes, depending on the \class{Node} type. For example, the |
| \member{bases} attribute of the \class{Class} node, is bound to a |
| list of base class nodes, and the \member{doc} attribute is bound to |
| a single node. |
| |
| Each \class{Node} instance has a \member{lineno} attribute which may |
| be \code{None}. XXX Not sure what the rules are for which nodes |
| will have a useful lineno. |
| \end{classdesc} |
| |
| All \class{Node} objects offer the following methods: |
| |
| \begin{methoddesc}{getChildren}{} |
| Returns a flattened list of the child nodes and objects in the |
| order they occur. Specifically, the order of the nodes is the |
| order in which they appear in the Python grammar. Not all of the |
| children are \class{Node} instances. The names of functions and |
| classes, for example, are plain strings. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{getChildNodes}{} |
| Returns a flattened list of the child nodes in the order they |
| occur. This method is like \method{getChildren()}, except that it |
| only returns those children that are \class{Node} instances. |
| \end{methoddesc} |
| |
| Two examples illustrate the general structure of \class{Node} |
| classes. The \keyword{while} statement is defined by the following |
| grammar production: |
| |
| \begin{verbatim} |
| while_stmt: "while" expression ":" suite |
| ["else" ":" suite] |
| \end{verbatim} |
| |
| The \class{While} node has three attributes: \member{test}, |
| \member{body}, and \member{else_}. (If the natural name for an |
| attribute is also a Python reserved word, it can't be used as an |
| attribute name. An underscore is appended to the word to make it a |
| legal identifier, hence \member{else_} instead of \keyword{else}.) |
| |
| The \keyword{if} statement is more complicated because it can include |
| several tests. |
| |
| \begin{verbatim} |
| if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite] |
| \end{verbatim} |
| |
| The \class{If} node only defines two attributes: \member{tests} and |
| \member{else_}. The \member{tests} attribute is a sequence of test |
| expression, consequent body pairs. There is one pair for each |
| \keyword{if}/\keyword{elif} clause. The first element of the pair is |
| the test expression. The second elements is a \class{Stmt} node that |
| contains the code to execute if the test is true. |
| |
| The \method{getChildren()} method of \class{If} returns a flat list of |
| child nodes. If there are three \keyword{if}/\keyword{elif} clauses |
| and no \keyword{else} clause, then \method{getChildren()} will return |
| a list of six elements: the first test expression, the first |
| \class{Stmt}, the second text expression, etc. |
| |
| The following table lists each of the \class{Node} subclasses defined |
| in \module{compiler.ast} and each of the public attributes available |
| on their instances. The values of most of the attributes are |
| themselves \class{Node} instances or sequences of instances. When the |
| value is something other than an instance, the type is noted in the |
| comment. The attributes are listed in the order in which they are |
| returned by \method{getChildren()} and \method{getChildNodes()}. |
| |
| \input{asttable} |
| |
| |
| \subsection{Assignment nodes} |
| |
| There is a collection of nodes used to represent assignments. Each |
| assignment statement in the source code becomes a single |
| \class{Assign} node in the AST. The \member{nodes} attribute is a |
| list that contains a node for each assignment target. This is |
| necessary because assignment can be chained, e.g. \code{a = b = 2}. |
| Each \class{Node} in the list will be one of the following classes: |
| \class{AssAttr}, \class{AssList}, \class{AssName}, or |
| \class{AssTuple}. |
| |
| Each target assignment node will describe the kind of object being |
| assigned to: \class{AssName} for a simple name, e.g. \code{a = 1}. |
| \class{AssAttr} for an attribute assigned, e.g. \code{a.x = 1}. |
| \class{AssList} and \class{AssTuple} for list and tuple expansion |
| respectively, e.g. \code{a, b, c = a_tuple}. |
| |
| The target assignment nodes also have a \member{flags} attribute that |
| indicates whether the node is being used for assignment or in a delete |
| statement. The \class{AssName} is also used to represent a delete |
| statement, e.g. \class{del x}. |
| |
| When an expression contains several attribute references, an |
| assignment or delete statement will contain only one \class{AssAttr} |
| node -- for the final attribute reference. The other attribute |
| references will be represented as \class{Getattr} nodes in the |
| \member{expr} attribute of the \class{AssAttr} instance. |
| |
| \subsection{Examples} |
| |
| This section shows several simple examples of ASTs for Python source |
| code. The examples demonstrate how to use the \function{parse()} |
| function, what the repr of an AST looks like, and how to access |
| attributes of an AST node. |
| |
| The first module defines a single function. Assume it is stored in |
| \file{/tmp/doublelib.py}. |
| |
| \begin{verbatim} |
| """This is an example module. |
| |
| This is the docstring. |
| """ |
| |
| def double(x): |
| "Return twice the argument" |
| return x * 2 |
| \end{verbatim} |
| |
| In the interactive interpreter session below, I have reformatted the |
| long AST reprs for readability. The AST reprs use unqualified class |
| names. If you want to create an instance from a repr, you must import |
| the class names from the \module{compiler.ast} module. |
| |
| \begin{verbatim} |
| >>> import compiler |
| >>> mod = compiler.parseFile("/tmp/doublelib.py") |
| >>> mod |
| Module('This is an example module.\n\nThis is the docstring.\n', |
| Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', |
| Stmt([Return(Mul((Name('x'), Const(2))))]))])) |
| >>> from compiler.ast import * |
| >>> Module('This is an example module.\n\nThis is the docstring.\n', |
| ... Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', |
| ... Stmt([Return(Mul((Name('x'), Const(2))))]))])) |
| Module('This is an example module.\n\nThis is the docstring.\n', |
| Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', |
| Stmt([Return(Mul((Name('x'), Const(2))))]))])) |
| >>> mod.doc |
| 'This is an example module.\n\nThis is the docstring.\n' |
| >>> for node in mod.node.nodes: |
| ... print node |
| ... |
| Function('double', ['x'], [], 0, 'Return twice the argument', |
| Stmt([Return(Mul((Name('x'), Const(2))))])) |
| >>> func = mod.node.nodes[0] |
| >>> func.code |
| Stmt([Return(Mul((Name('x'), Const(2))))]) |
| \end{verbatim} |
| |
| \section{Using Visitors to Walk ASTs} |
| |
| \declaremodule{}{compiler.visitor} |
| |
| The visitor pattern is ... The \refmodule{compiler} package uses a |
| variant on the visitor pattern that takes advantage of Python's |
| introspection features to elminiate the need for much of the visitor's |
| infrastructure. |
| |
| The classes being visited do not need to be programmed to accept |
| visitors. The visitor need only define visit methods for classes it |
| is specifically interested in; a default visit method can handle the |
| rest. |
| |
| XXX The magic \method{visit()} method for visitors. |
| |
| \begin{funcdesc}{walk}{tree, visitor\optional{, verbose}} |
| \end{funcdesc} |
| |
| \begin{classdesc}{ASTVisitor}{} |
| |
| The \class{ASTVisitor} is responsible for walking over the tree in the |
| correct order. A walk begins with a call to \method{preorder()}. For |
| each node, it checks the \var{visitor} argument to \method{preorder()} |
| for a method named `visitNodeType,' where NodeType is the name of the |
| node's class, e.g. for a \class{While} node a \method{visitWhile()} |
| would be called. If the method exists, it is called with the node as |
| its first argument. |
| |
| The visitor method for a particular node type can control how child |
| nodes are visited during the walk. The \class{ASTVisitor} modifies |
| the visitor argument by adding a visit method to the visitor; this |
| method can be used to visit a particular child node. If no visitor is |
| found for a particular node type, the \method{default()} method is |
| called. |
| \end{classdesc} |
| |
| \class{ASTVisitor} objects have the following methods: |
| |
| XXX describe extra arguments |
| |
| \begin{methoddesc}{default}{node\optional{, \moreargs}} |
| \end{methoddesc} |
| |
| \begin{methoddesc}{dispatch}{node\optional{, \moreargs}} |
| \end{methoddesc} |
| |
| \begin{methoddesc}{preorder}{tree, visitor} |
| \end{methoddesc} |
| |
| |
| \section{Bytecode Generation} |
| |
| The code generator is a visitor that emits bytecodes. Each visit method |
| can call the \method{emit()} method to emit a new bytecode. The basic |
| code generator is specialized for modules, classes, and functions. An |
| assembler converts that emitted instructions to the low-level bytecode |
| format. It handles things like generator of constant lists of code |
| objects and calculation of jump offsets. |