blob: 280ee0fa45ae3bc52a16f5c82f5b906366ea9076 [file] [log] [blame]
Fred Drakee2f99172001-09-27 20:06:07 +00001\chapter{Python compiler package \label{compiler}}
2
3\sectionauthor{Jeremy Hylton}{jeremy@zope.com}
4
5
6The Python compiler package is a tool for analyzing Python source code
7and generating Python bytecode. The compiler contains libraries to
8generate an abstract syntax tree from Python source code and to
9generate Python bytecode from the tree.
10
11The \refmodule{compiler} package is a Python source to bytecode
12translator written in Python. It uses the built-in parser and
13standard \refmodule{parser} module to generated a concrete syntax
14tree. This tree is used to generate an abstract syntax tree (AST) and
15then Python bytecode.
16
17The full functionality of the package duplicates the builtin compiler
18provided with the Python interpreter. It is intended to match its
19behavior almost exactly. Why implement another compiler that does the
20same thing? The package is useful for a variety of purposes. It can
21be modified more easily than the builtin compiler. The AST it
22generates is useful for analyzing Python source code.
23
24This chapter explains how the various components of the
25\refmodule{compiler} package work. It blends reference material with
26a tutorial.
27
28The following modules are part of the \refmodule{compiler} package:
29
30\localmoduletable
31
32
33\subsection{The basic interface}
34
35\declaremodule{}{compiler}
36
37The top-level of the package defines four functions. If you import
38\module{compiler}, you will get these functions and a collection of
39modules contained in the package.
40
41\begin{funcdesc}{parse}{buf}
42Returns an abstract syntax tree for the Python source code in \var{buf}.
43The function raises SyntaxError if there is an error in the source
44code. The return value is a \class{compiler.ast.Module} instance that
45contains the tree.
46\end{funcdesc}
47
48\begin{funcdesc}{parseFile}{path}
49Return an abstract syntax tree for the Python source code in the file
50specified by \var{path}. It is equivalent to
51\code{parse(open(\var{path}).read())}.
52\end{funcdesc}
53
54\begin{funcdesc}{walk}{ast, visitor\optional{, verbose}}
55Do a pre-order walk over the abstract syntax tree \var{ast}. Call the
56appropriate method on the \var{visitor} instance for each node
57encountered.
58\end{funcdesc}
59
60\begin{funcdesc}{compile}{path}
61Compile the file \var{path} and generate the corresponding \file{.pyc}
62file.
63\end{funcdesc}
64
65The \module{compiler} package contains the following modules:
66\refmodule[compiler.ast]{ast}, \module{consts}, \module{future},
67\module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols},
68\module{transformer}, and \refmodule[compiler.visitor]{visitor}.
69
70\subsection{Limitations}
71
72There are some problems with the error checking of the compiler
73package. The interpreter detects syntax errors in two distinct
74phases. One set of errors is detected by the interpreter's parser,
75the other set by the compiler. The compiler package relies on the
76interpreter's parser, so it get the first phases of error checking for
77free. It implements the second phase itself, and that implement is
78incomplete. For example, the compiler package does not raise an error
79if a name appears more than once in an argument list:
80\code{def f(x, x): ...}
81
82A future version of the compiler should fix these problems.
83
84\section{Python Abstract Syntax}
85
86The \module{compiler.ast} module defines an abstract syntax for
87Python. In the abstract syntax tree, each node represents a syntactic
88construct. The root of the tree is \class{Module} object.
89
90The abstract syntax offers a higher level interface to parsed Python
91source code. The \ulink{\module{parser}}
92{http://www.python.org/doc/current/lib/module-parser.html}
93module and the compiler written in C for the Python interpreter use a
94concrete syntax tree. The concrete syntax is tied closely to the
95grammar description used for the Python parser. Instead of a single
96node for a construct, there are often several levels of nested nodes
97that are introduced by Python's precedence rules.
98
99The abstract syntax tree is created by the
100\module{compiler.transformer} module. The transformer relies on the
101builtin Python parser to generate a concrete syntax tree. It
102generates an abstract syntax tree from the concrete tree.
103
104The \module{transformer} module was created by Greg
105Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an
106experimental Python-to-C compiler. The current version contains a
107number of modifications and improvements, but the basic form of the
108abstract syntax and of the transformer are due to Stein and Tutt.
109
110\subsection{AST Nodes}
111
112\declaremodule{}{compiler.ast}
113
114The \module{compiler.ast} module is generated from a text file that
115describes each node type and its elements. Each node type is
116represented as a class that inherits from the abstract base class
117\class{compiler.ast.Node} and defines a set of named attributes for
118child nodes.
119
120\begin{classdesc}{Node}{}
121
122 The \class{Node} instances are created automatically by the parser
123 generator. The recommended interface for specific \class{Node}
124 instances is to use the public attributes to access child nodes. A
125 public attribute may be bound to a single node or to a sequence of
126 nodes, depending on the \class{Node} type. For example, the
127 \member{bases} attribute of the \class{Class} node, is bound to a
128 list of base class nodes, and the \member{doc} attribute is bound to
129 a single node.
130
131 Each \class{Node} instance has a \member{lineno} attribute which may
132 be \code{None}. XXX Not sure what the rules are for which nodes
133 will have a useful lineno.
134\end{classdesc}
135
136All \class{Node} objects offer the following methods:
137
138\begin{methoddesc}{getChildren}{}
139 Returns a flattened list of the child nodes and objects in the
140 order they occur. Specifically, the order of the nodes is the
141 order in which they appear in the Python grammar. Not all of the
142 children are \class{Node} instances. The names of functions and
143 classes, for example, are plain strings.
144\end{methoddesc}
145
146\begin{methoddesc}{getChildNodes}{}
147 Returns a flattened list of the child nodes in the order they
148 occur. This method is like \method{getChildren()}, except that it
149 only returns those children that are \class{Node} instances.
150\end{methoddesc}
151
152Two examples illustrate the general structure of \class{Node}
153classes. The \keyword{while} statement is defined by the following
154grammar production:
155
156\begin{verbatim}
157while_stmt: "while" expression ":" suite
158 ["else" ":" suite]
159\end{verbatim}
160
161The \class{While} node has three attributes: \member{test},
162\member{body}, and \member{else_}. (If the natural name for an
163attribute is also a Python reserved word, it can't be used as an
164attribute name. An underscore is appended to the word to make it a
165legal identifier, hence \member{else_} instead of \keyword{else}.)
166
167The \keyword{if} statement is more complicated because it can include
168several tests.
169
170\begin{verbatim}
171if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
172\end{verbatim}
173
174The \class{If} node only defines two attributes: \member{tests} and
175\member{else_}. The \member{tests} attribute is a sequence of test
176expression, consequent body pairs. There is one pair for each
177\keyword{if}/\keyword{elif} clause. The first element of the pair is
178the test expression. The second elements is a \class{Stmt} node that
179contains the code to execute if the test is true.
180
181The \method{getChildren()} method of \class{If} returns a flat list of
182child nodes. If there are three \keyword{if}/\keyword{elif} clauses
183and no \keyword{else} clause, then \method{getChildren()} will return
184a list of six elements: the first test expression, the first
185\class{Stmt}, the second text expression, etc.
186
187The following table lists each of the \class{Node} subclasses defined
188in \module{compiler.ast} and each of the public attributes available
189on their instances. The values of most of the attributes are
190themselves \class{Node} instances or sequences of instances. When the
191value is something other than an instance, the type is noted in the
192comment. The attributes are listed in the order in which they are
193returned by \method{getChildren()} and \method{getChildNodes()}.
194
195\input{asttable}
196
197
198\subsection{Assignment nodes}
199
200There is a collection of nodes used to represent assignments. Each
201assignment statement in the source code becomes a single
202\class{Assign} node in the AST. The \member{nodes} attribute is a
203list that contains a node for each assignment target. This is
204necessary because assignment can be chained, e.g. \code{a = b = 2}.
205Each \class{Node} in the list will be one of the following classes:
206\class{AssAttr}, \class{AssList}, \class{AssName}, or
207\class{AssTuple}.
208
209Each target assignment node will describe the kind of object being
210assigned to: \class{AssName} for a simple name, e.g. \code{a = 1}.
211\class{AssAttr} for an attribute assigned, e.g. \code{a.x = 1}.
212\class{AssList} and \class{AssTuple} for list and tuple expansion
213respectively, e.g. \code{a, b, c = a_tuple}.
214
215The target assignment nodes also have a \member{flags} attribute that
216indicates whether the node is being used for assignment or in a delete
217statement. The \class{AssName} is also used to represent a delete
218statement, e.g. \class{del x}.
219
220When an expression contains several attribute references, an
221assignment or delete statement will contain only one \class{AssAttr}
222node -- for the final attribute reference. The other attribute
223references will be represented as \class{Getattr} nodes in the
224\member{expr} attribute of the \class{AssAttr} instance.
225
226\subsection{Examples}
227
228This section shows several simple examples of ASTs for Python source
229code. The examples demonstrate how to use the \function{parse()}
230function, what the repr of an AST looks like, and how to access
231attributes of an AST node.
232
233The first module defines a single function. Assume it is stored in
234\file{/tmp/doublelib.py}.
235
236\begin{verbatim}
237"""This is an example module.
238
239This is the docstring.
240"""
241
242def double(x):
243 "Return twice the argument"
244 return x * 2
245\end{verbatim}
246
247In the interactive interpreter session below, I have reformatted the
248long AST reprs for readability. The AST reprs use unqualified class
249names. If you want to create an instance from a repr, you must import
250the class names from the \module{compiler.ast} module.
251
252\begin{verbatim}
253>>> import compiler
254>>> mod = compiler.parseFile("/tmp/doublelib.py")
255>>> mod
256Module('This is an example module.\n\nThis is the docstring.\n',
257 Stmt([Function('double', ['x'], [], 0, 'Return twice the argument',
258 Stmt([Return(Mul((Name('x'), Const(2))))]))]))
259>>> from compiler.ast import *
260>>> Module('This is an example module.\n\nThis is the docstring.\n',
261... Stmt([Function('double', ['x'], [], 0, 'Return twice the argument',
262... Stmt([Return(Mul((Name('x'), Const(2))))]))]))
263Module('This is an example module.\n\nThis is the docstring.\n',
264 Stmt([Function('double', ['x'], [], 0, 'Return twice the argument',
265 Stmt([Return(Mul((Name('x'), Const(2))))]))]))
266>>> mod.doc
267'This is an example module.\n\nThis is the docstring.\n'
268>>> for node in mod.node.nodes:
269... print node
270...
271Function('double', ['x'], [], 0, 'Return twice the argument',
272 Stmt([Return(Mul((Name('x'), Const(2))))]))
273>>> func = mod.node.nodes[0]
274>>> func.code
275Stmt([Return(Mul((Name('x'), Const(2))))])
276\end{verbatim}
277
278\section{Using Visitors to Walk ASTs}
279
280\declaremodule{}{compiler.visitor}
281
282The visitor pattern is ... The \refmodule{compiler} package uses a
283variant on the visitor pattern that takes advantage of Python's
284introspection features to elminiate the need for much of the visitor's
285infrastructure.
286
287The classes being visited do not need to be programmed to accept
288visitors. The visitor need only define visit methods for classes it
289is specifically interested in; a default visit method can handle the
290rest.
291
292XXX The magic \method{visit()} method for visitors.
293
294\begin{funcdesc}{walk}{tree, visitor\optional{, verbose}}
295\end{funcdesc}
296
297\begin{classdesc}{ASTVisitor}{}
298
299The \class{ASTVisitor} is responsible for walking over the tree in the
300correct order. A walk begins with a call to \method{preorder()}. For
301each node, it checks the \var{visitor} argument to \method{preorder()}
302for a method named `visitNodeType,' where NodeType is the name of the
303node's class, e.g. for a \class{While} node a \method{visitWhile()}
304would be called. If the method exists, it is called with the node as
305its first argument.
306
307The visitor method for a particular node type can control how child
308nodes are visited during the walk. The \class{ASTVisitor} modifies
309the visitor argument by adding a visit method to the visitor; this
310method can be used to visit a particular child node. If no visitor is
311found for a particular node type, the \method{default()} method is
312called.
313\end{classdesc}
314
315\class{ASTVisitor} objects have the following methods:
316
317XXX describe extra arguments
318
319\begin{methoddesc}{default}{node\optional{, \moreargs}}
320\end{methoddesc}
321
322\begin{methoddesc}{dispatch}{node\optional{, \moreargs}}
323\end{methoddesc}
324
325\begin{methoddesc}{preorder}{tree, visitor}
326\end{methoddesc}
327
328
329\section{Bytecode Generation}
330
331The code generator is a visitor that emits bytecodes. Each visit method
332can call the \method{emit()} method to emit a new bytecode. The basic
333code generator is specialized for modules, classes, and functions. An
334assembler converts that emitted instructions to the low-level bytecode
335format. It handles things like generator of constant lists of code
336objects and calculation of jump offsets.