| \chapter{Python compiler package \label{compiler}} | 
 |  | 
 | \sectionauthor{Jeremy Hylton}{jeremy@zope.com} | 
 |  | 
 |  | 
 | The Python compiler package is a tool for analyzing Python source code | 
 | and generating Python bytecode.  The compiler contains libraries to | 
 | generate an abstract syntax tree from Python source code and to | 
 | generate Python bytecode from the tree. | 
 |  | 
 | The \refmodule{compiler} package is a Python source to bytecode | 
 | translator written in Python.  It uses the built-in parser and | 
 | standard \refmodule{parser} module to generated a concrete syntax | 
 | tree.  This tree is used to generate an abstract syntax tree (AST) and | 
 | then Python bytecode. | 
 |  | 
 | The full functionality of the package duplicates the builtin compiler | 
 | provided with the Python interpreter.  It is intended to match its | 
 | behavior almost exactly.  Why implement another compiler that does the | 
 | same thing?  The package is useful for a variety of purposes.  It can | 
 | be modified more easily than the builtin compiler.  The AST it | 
 | generates is useful for analyzing Python source code. | 
 |  | 
 | This chapter explains how the various components of the | 
 | \refmodule{compiler} package work.  It blends reference material with | 
 | a tutorial. | 
 |  | 
 | The following modules are part of the \refmodule{compiler} package: | 
 |  | 
 | \localmoduletable | 
 |  | 
 |  | 
 | \section{The basic interface} | 
 |  | 
 | \declaremodule{}{compiler} | 
 |  | 
 | The top-level of the package defines four functions.  If you import | 
 | \module{compiler}, you will get these functions and a collection of | 
 | modules contained in the package. | 
 |  | 
 | \begin{funcdesc}{parse}{buf} | 
 | Returns an abstract syntax tree for the Python source code in \var{buf}. | 
 | The function raises SyntaxError if there is an error in the source | 
 | code.  The return value is a \class{compiler.ast.Module} instance that | 
 | contains the tree.   | 
 | \end{funcdesc} | 
 |  | 
 | \begin{funcdesc}{parseFile}{path} | 
 | Return an abstract syntax tree for the Python source code in the file | 
 | specified by \var{path}.  It is equivalent to | 
 | \code{parse(open(\var{path}).read())}. | 
 | \end{funcdesc} | 
 |  | 
 | \begin{funcdesc}{walk}{ast, visitor\optional{, verbose}} | 
 | Do a pre-order walk over the abstract syntax tree \var{ast}.  Call the | 
 | appropriate method on the \var{visitor} instance for each node | 
 | encountered. | 
 | \end{funcdesc} | 
 |  | 
 | \begin{funcdesc}{compile}{source, filename, mode, flags=None,  | 
 | 			dont_inherit=None} | 
 | Compile the string \var{source}, a Python module, statement or | 
 | expression, into a code object that can be executed by the exec | 
 | statement or \function{eval()}. This function is a replacement for the | 
 | built-in \function{compile()} function. | 
 |  | 
 | The \var{filename} will be used for run-time error messages. | 
 |  | 
 | The \var{mode} must be 'exec' to compile a module, 'single' to compile a | 
 | single (interactive) statement, or 'eval' to compile an expression. | 
 |  | 
 | The \var{flags} and \var{dont_inherit} arguments affect future-related | 
 | statements, but are not supported yet. | 
 | \end{funcdesc} | 
 |  | 
 | \begin{funcdesc}{compileFile}{source} | 
 | Compiles the file \var{source} and generates a .pyc file. | 
 | \end{funcdesc} | 
 |  | 
 | The \module{compiler} package contains the following modules: | 
 | \refmodule[compiler.ast]{ast}, \module{consts}, \module{future}, | 
 | \module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols}, | 
 | \module{transformer}, and \refmodule[compiler.visitor]{visitor}. | 
 |  | 
 | \section{Limitations} | 
 |  | 
 | There are some problems with the error checking of the compiler | 
 | package.  The interpreter detects syntax errors in two distinct | 
 | phases.  One set of errors is detected by the interpreter's parser, | 
 | the other set by the compiler.  The compiler package relies on the | 
 | interpreter's parser, so it get the first phases of error checking for | 
 | free.  It implements the second phase itself, and that implementation is | 
 | incomplete.  For example, the compiler package does not raise an error | 
 | if a name appears more than once in an argument list:  | 
 | \code{def f(x, x): ...} | 
 |  | 
 | A future version of the compiler should fix these problems. | 
 |  | 
 | \section{Python Abstract Syntax} | 
 |  | 
 | The \module{compiler.ast} module defines an abstract syntax for | 
 | Python.  In the abstract syntax tree, each node represents a syntactic | 
 | construct.  The root of the tree is \class{Module} object. | 
 |  | 
 | The abstract syntax offers a higher level interface to parsed Python | 
 | source code.  The \ulink{\module{parser}} | 
 | {http://www.python.org/doc/current/lib/module-parser.html} | 
 | module and the compiler written in C for the Python interpreter use a | 
 | concrete syntax tree.  The concrete syntax is tied closely to the | 
 | grammar description used for the Python parser.  Instead of a single | 
 | node for a construct, there are often several levels of nested nodes | 
 | that are introduced by Python's precedence rules. | 
 |  | 
 | The abstract syntax tree is created by the | 
 | \module{compiler.transformer} module.  The transformer relies on the | 
 | builtin Python parser to generate a concrete syntax tree.  It | 
 | generates an abstract syntax tree from the concrete tree.   | 
 |  | 
 | The \module{transformer} module was created by Greg | 
 | Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an | 
 | experimental Python-to-C compiler.  The current version contains a | 
 | number of modifications and improvements, but the basic form of the | 
 | abstract syntax and of the transformer are due to Stein and Tutt. | 
 |  | 
 | \subsection{AST Nodes} | 
 |  | 
 | \declaremodule{}{compiler.ast} | 
 |  | 
 | The \module{compiler.ast} module is generated from a text file that | 
 | describes each node type and its elements.  Each node type is | 
 | represented as a class that inherits from the abstract base class | 
 | \class{compiler.ast.Node} and defines a set of named attributes for | 
 | child nodes. | 
 |  | 
 | \begin{classdesc}{Node}{} | 
 |    | 
 |   The \class{Node} instances are created automatically by the parser | 
 |   generator.  The recommended interface for specific \class{Node} | 
 |   instances is to use the public attributes to access child nodes.  A | 
 |   public attribute may be bound to a single node or to a sequence of | 
 |   nodes, depending on the \class{Node} type.  For example, the | 
 |   \member{bases} attribute of the \class{Class} node, is bound to a | 
 |   list of base class nodes, and the \member{doc} attribute is bound to | 
 |   a single node. | 
 |    | 
 |   Each \class{Node} instance has a \member{lineno} attribute which may | 
 |   be \code{None}.  XXX Not sure what the rules are for which nodes | 
 |   will have a useful lineno. | 
 | \end{classdesc} | 
 |  | 
 | All \class{Node} objects offer the following methods: | 
 |  | 
 | \begin{methoddesc}{getChildren}{} | 
 |   Returns a flattened list of the child nodes and objects in the | 
 |   order they occur.  Specifically, the order of the nodes is the | 
 |   order in which they appear in the Python grammar.  Not all of the | 
 |   children are \class{Node} instances.  The names of functions and | 
 |   classes, for example, are plain strings. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{getChildNodes}{} | 
 |   Returns a flattened list of the child nodes in the order they | 
 |   occur.  This method is like \method{getChildren()}, except that it | 
 |   only returns those children that are \class{Node} instances. | 
 | \end{methoddesc} | 
 |  | 
 | Two examples illustrate the general structure of \class{Node} | 
 | classes.  The \keyword{while} statement is defined by the following | 
 | grammar production:  | 
 |  | 
 | \begin{verbatim} | 
 | while_stmt:     "while" expression ":" suite | 
 |                ["else" ":" suite] | 
 | \end{verbatim} | 
 |  | 
 | The \class{While} node has three attributes: \member{test}, | 
 | \member{body}, and \member{else_}.  (If the natural name for an | 
 | attribute is also a Python reserved word, it can't be used as an | 
 | attribute name.  An underscore is appended to the word to make it a | 
 | legal identifier, hence \member{else_} instead of \keyword{else}.) | 
 |  | 
 | The \keyword{if} statement is more complicated because it can include | 
 | several tests.   | 
 |  | 
 | \begin{verbatim} | 
 | if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite] | 
 | \end{verbatim} | 
 |  | 
 | The \class{If} node only defines two attributes: \member{tests} and | 
 | \member{else_}.  The \member{tests} attribute is a sequence of test | 
 | expression, consequent body pairs.  There is one pair for each | 
 | \keyword{if}/\keyword{elif} clause.  The first element of the pair is | 
 | the test expression.  The second elements is a \class{Stmt} node that | 
 | contains the code to execute if the test is true. | 
 |  | 
 | The \method{getChildren()} method of \class{If} returns a flat list of | 
 | child nodes.  If there are three \keyword{if}/\keyword{elif} clauses | 
 | and no \keyword{else} clause, then \method{getChildren()} will return | 
 | a list of six elements: the first test expression, the first | 
 | \class{Stmt}, the second text expression, etc. | 
 |  | 
 | The following table lists each of the \class{Node} subclasses defined | 
 | in \module{compiler.ast} and each of the public attributes available | 
 | on their instances.  The values of most of the attributes are | 
 | themselves \class{Node} instances or sequences of instances.  When the | 
 | value is something other than an instance, the type is noted in the | 
 | comment.  The attributes are listed in the order in which they are | 
 | returned by \method{getChildren()} and \method{getChildNodes()}. | 
 |  | 
 | \input{asttable} | 
 |  | 
 |  | 
 | \subsection{Assignment nodes} | 
 |  | 
 | There is a collection of nodes used to represent assignments.  Each | 
 | assignment statement in the source code becomes a single | 
 | \class{Assign} node in the AST.  The \member{nodes} attribute is a | 
 | list that contains a node for each assignment target.  This is | 
 | necessary because assignment can be chained, e.g. \code{a = b = 2}. | 
 | Each \class{Node} in the list will be one of the following classes:  | 
 | \class{AssAttr}, \class{AssList}, \class{AssName}, or | 
 | \class{AssTuple}.  | 
 |  | 
 | Each target assignment node will describe the kind of object being | 
 | assigned to:  \class{AssName} for a simple name, e.g. \code{a = 1}. | 
 | \class{AssAttr} for an attribute assigned, e.g. \code{a.x = 1}. | 
 | \class{AssList} and \class{AssTuple} for list and tuple expansion | 
 | respectively, e.g. \code{a, b, c = a_tuple}. | 
 |  | 
 | The target assignment nodes also have a \member{flags} attribute that | 
 | indicates whether the node is being used for assignment or in a delete | 
 | statement.  The \class{AssName} is also used to represent a delete | 
 | statement, e.g. \class{del x}. | 
 |  | 
 | When an expression contains several attribute references, an | 
 | assignment or delete statement will contain only one \class{AssAttr} | 
 | node -- for the final attribute reference.  The other attribute | 
 | references will be represented as \class{Getattr} nodes in the | 
 | \member{expr} attribute of the \class{AssAttr} instance. | 
 |  | 
 | \subsection{Examples} | 
 |  | 
 | This section shows several simple examples of ASTs for Python source | 
 | code.  The examples demonstrate how to use the \function{parse()} | 
 | function, what the repr of an AST looks like, and how to access | 
 | attributes of an AST node. | 
 |  | 
 | The first module defines a single function.  Assume it is stored in | 
 | \file{/tmp/doublelib.py}.  | 
 |  | 
 | \begin{verbatim} | 
 | """This is an example module. | 
 |  | 
 | This is the docstring. | 
 | """ | 
 |  | 
 | def double(x): | 
 |     "Return twice the argument" | 
 |     return x * 2 | 
 | \end{verbatim} | 
 |  | 
 | In the interactive interpreter session below, I have reformatted the | 
 | long AST reprs for readability.  The AST reprs use unqualified class | 
 | names.  If you want to create an instance from a repr, you must import | 
 | the class names from the \module{compiler.ast} module. | 
 |  | 
 | \begin{verbatim} | 
 | >>> import compiler | 
 | >>> mod = compiler.parseFile("/tmp/doublelib.py") | 
 | >>> mod | 
 | Module('This is an example module.\n\nThis is the docstring.\n',  | 
 |        Stmt([Function('double', ['x'], [], 0, 'Return twice the argument',  | 
 |        Stmt([Return(Mul((Name('x'), Const(2))))]))])) | 
 | >>> from compiler.ast import * | 
 | >>> Module('This is an example module.\n\nThis is the docstring.\n',  | 
 | ...    Stmt([Function('double', ['x'], [], 0, 'Return twice the argument',  | 
 | ...    Stmt([Return(Mul((Name('x'), Const(2))))]))])) | 
 | Module('This is an example module.\n\nThis is the docstring.\n',  | 
 |        Stmt([Function('double', ['x'], [], 0, 'Return twice the argument',  | 
 |        Stmt([Return(Mul((Name('x'), Const(2))))]))])) | 
 | >>> mod.doc | 
 | 'This is an example module.\n\nThis is the docstring.\n' | 
 | >>> for node in mod.node.nodes: | 
 | ...     print node | 
 | ...  | 
 | Function('double', ['x'], [], 0, 'Return twice the argument', | 
 |          Stmt([Return(Mul((Name('x'), Const(2))))])) | 
 | >>> func = mod.node.nodes[0] | 
 | >>> func.code | 
 | Stmt([Return(Mul((Name('x'), Const(2))))]) | 
 | \end{verbatim} | 
 |  | 
 | \section{Using Visitors to Walk ASTs} | 
 |  | 
 | \declaremodule{}{compiler.visitor} | 
 |  | 
 | The visitor pattern is ...  The \refmodule{compiler} package uses a | 
 | variant on the visitor pattern that takes advantage of Python's | 
 | introspection features to elminiate the need for much of the visitor's | 
 | infrastructure. | 
 |  | 
 | The classes being visited do not need to be programmed to accept | 
 | visitors.  The visitor need only define visit methods for classes it | 
 | is specifically interested in; a default visit method can handle the | 
 | rest.  | 
 |  | 
 | XXX The magic \method{visit()} method for visitors. | 
 |  | 
 | \begin{funcdesc}{walk}{tree, visitor\optional{, verbose}} | 
 | \end{funcdesc} | 
 |  | 
 | \begin{classdesc}{ASTVisitor}{} | 
 |  | 
 | The \class{ASTVisitor} is responsible for walking over the tree in the | 
 | correct order.  A walk begins with a call to \method{preorder()}.  For | 
 | each node, it checks the \var{visitor} argument to \method{preorder()} | 
 | for a method named `visitNodeType,' where NodeType is the name of the | 
 | node's class, e.g. for a \class{While} node a \method{visitWhile()} | 
 | would be called.  If the method exists, it is called with the node as | 
 | its first argument. | 
 |  | 
 | The visitor method for a particular node type can control how child | 
 | nodes are visited during the walk.  The \class{ASTVisitor} modifies | 
 | the visitor argument by adding a visit method to the visitor; this | 
 | method can be used to visit a particular child node.  If no visitor is | 
 | found for a particular node type, the \method{default()} method is | 
 | called.  | 
 | \end{classdesc} | 
 |  | 
 | \class{ASTVisitor} objects have the following methods: | 
 |  | 
 | XXX describe extra arguments | 
 |  | 
 | \begin{methoddesc}{default}{node\optional{, \moreargs}} | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{dispatch}{node\optional{, \moreargs}} | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{preorder}{tree, visitor} | 
 | \end{methoddesc} | 
 |  | 
 |  | 
 | \section{Bytecode Generation} | 
 |  | 
 | The code generator is a visitor that emits bytecodes.  Each visit method | 
 | can call the \method{emit()} method to emit a new bytecode.  The basic | 
 | code generator is specialized for modules, classes, and functions.  An | 
 | assembler converts that emitted instructions to the low-level bytecode | 
 | format.  It handles things like generator of constant lists of code | 
 | objects and calculation of jump offsets. |