blob: 12f2128b551f0a3753856d7c33baca52d9596238 [file] [log] [blame]
Brett Cannon82a93942006-02-09 02:43:14 +00001Developer Notes for Python Compiler
2===================================
3
4Table of Contents
5-----------------
6
7- Scope
8 Defines the limits of the change
9- Parse Trees
10 Describes the local (Python) concept
11- Abstract Syntax Trees (AST)
12 Describes the AST technology used
13- Parse Tree to AST
14 Defines the transform approach
15- Control Flow Graphs
16 Defines the creation of "basic blocks"
17- AST to CFG to Bytecode
18 Tracks the flow from AST to bytecode
19- Code Objects
20 Pointer to making bytecode "executable"
21- Modified Files
22 Files added/modified/removed from CPython compiler
23- ToDo
24 Work yet remaining (before complete)
25- References
26 Academic and technical references to technology used.
27
28
29Scope
30-----
31
32Historically (through 2.4), compilation from source code to bytecode
33involved two steps:
34
351. Parse the source code into a parse tree (Parser/pgen.c)
362. Emit bytecode based on the parse tree (Python/compile.c)
37
38Historically, this is not how a standard compiler works. The usual
39steps for compilation are:
40
411. Parse source code into a parse tree (Parser/pgen.c)
422. Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
433. Transform AST into a Control Flow Graph (Python/newcompile.c)
444. Emit bytecode based on the Control Flow Graph (Python/newcompile.c)
45
46Starting with Python 2.5, the above steps are now used. This change
47was done to simplify compilation by breaking it into three steps.
48The purpose of this document is to outline how the lattter three steps
49of the process works.
50
51This document does not touch on how parsing works beyond what is needed
52to explain what is needed for compilation. It is also not exhaustive
53in terms of the how the entire system works. You will most likely need
54to read some source to have an exact understanding of all details.
55
56
57Parse Trees
58-----------
59
60Python's parser is an LL(1) parser mostly based off of the
61implementation laid out in the Dragon Book [Aho86]_.
62
63The grammar file for Python can be found in Grammar/Grammar with the
64numeric value of grammar rules are stored in Include/graminit.h. The
65numeric values for types of tokens (literal tokens, such as ``:``,
66numbers, etc.) are kept in Include/token.h). The parse tree made up of
67``node *`` structs (as defined in Include/node.h).
68
69Querying data from the node structs can be done with the following
70macros (which are all defined in Include/token.h):
71
72- ``CHILD(node *, int)``
73 Returns the nth child of the node using zero-offset indexing
74- ``RCHILD(node *, int)``
75 Returns the nth child of the node from the right side; use
76 negative numbers!
77- ``NCH(node *)``
78 Number of children the node has
79- ``STR(node *)``
80 String representation of the node; e.g., will return ``:`` for a
81 COLON token
82- ``TYPE(node *)``
83 The type of node as specified in ``Include/graminit.h``
84- ``REQ(node *, TYPE)``
85 Assert that the node is the type that is expected
86- ``LINENO(node *)``
87 retrieve the line number of the source code that led to the
88 creation of the parse rule; defined in Python/ast.c
89
90To tie all of this example, consider the rule for 'while'::
91
92 while_stmt: 'while' test ':' suite ['else' ':' suite]
93
94The node representing this will have ``TYPE(node) == while_stmt`` and
95the number of children can be 4 or 7 depending on if there is an 'else'
96statement. To access what should be the first ':' and require it be an
97actual ':' token, `(REQ(CHILD(node, 2), COLON)``.
98
99
100Abstract Syntax Trees (AST)
101---------------------------
102
103The abstract syntax tree (AST) is a high-level representation of the
104program structure without the necessity of containing the source code;
105it can be thought of a abstract representation of the source code. The
106specification of the AST nodes is specified using the Zephyr Abstract
107Syntax Definition Language (ASDL) [Wang97]_.
108
109The definition of the AST nodes for Python is found in the file
110Parser/Python.asdl .
111
112Each AST node (representing statements, expressions, and several
113specialized types, like list comprehensions and exception handlers) is
114defined by the ASDL. Most definitions in the AST correspond to a
115particular source construct, such as an 'if' statement or an attribute
116lookup. The definition is independent of its realization in any
117particular programming language.
118
119The following fragment of the Python ASDL construct demonstrates the
120approach and syntax::
121
122 module Python
123 {
124 stmt = FunctionDef(identifier name, arguments args, stmt* body,
125 expr* decorators)
126 | Return(expr? value) | Yield(expr value)
127 attributes (int lineno)
128 }
129
130The preceding example describes three different kinds of statements;
131function definitions, return statements, and yield statements. All
132three kinds are considered of type stmt as shown by '|' separating the
133various kinds. They all take arguments of various kinds and amounts.
134
135Modifiers on the argument type specify the number of values needed; '?'
136means it is optional, '*' means 0 or more, no modifier means only one
137value for the argument and it is required. FunctionDef, for instance,
138takes an identifier for the name, 'arguments' for args, zero or more
139stmt arguments for 'body', and zero or more expr arguments for
140'decorators'.
141
142Do notice that something like 'arguments', which is a node type, is
143represented as a single AST node and not as a sequence of nodes as with
144stmt as one might expect.
145
146All three kinds also have an 'attributes' argument; this is shown by the
147fact that 'attributes' lacks a '|' before it.
148
149The statement definitions above generate the following C structure type::
150
151 typedef struct _stmt *stmt_ty;
152
153 struct _stmt {
154 enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
155 union {
156 struct {
157 identifier name;
158 arguments_ty args;
159 asdl_seq *body;
160 } FunctionDef;
161
162 struct {
163 expr_ty value;
164 } Return;
165
166 struct {
167 expr_ty value;
168 } Yield;
169 } v;
170 int lineno;
171 }
172
173Also generated are a series of constructor functions that allocate (in
174this case) a stmt_ty struct with the appropriate initialization. The
175'kind' field specifies which component of the union is initialized. The
176FunctionDef() constructor function sets 'kind' to FunctionDef_kind and
177initializes the 'name', 'args', 'body', and 'attributes' fields.
178
179*** NOTE: if you make a change here that can affect the output of bytecode that
180is already in existence, make sure to delete your old .py(c|o) files! Running
181``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick.
182
183
184Parse Tree to AST
185-----------------
186
187The AST is generated from the parse tree in (see Python/ast.c) using the
188function::
189
190 mod_ty PyAST_FromNode(const node *n);
191
192The function begins a tree walk of the parse tree, creating various AST
193nodes as it goes along. It does this by allocating all new nodes it
194needs, calling the proper AST node creation functions for any required
195supporting functions, and connecting them as needed.
196
197Do realize that there is no automated nor symbolic connection between
198the grammar specification and the nodes in the parse tree. No help is
199directly provided by the parse tree as in yacc.
200
201For instance, one must keep track of
202which node in the parse tree one is working with (e.g., if you are
203working with an 'if' statement you need to watch out for the ':' token
204to find the end of the conditional). No help is directly provided by
205the parse tree as in yacc.
206
207The functions called to generate AST nodes from the parse tree all have
208the name ast_for_xx where xx is what the grammar rule that the function
209handles (alias_for_import_name is the exception to this). These in turn
210call the constructor functions as defined by the ASDL grammar and
211contained in Python/Python-ast.c (which was generated by
212Parser/asdl_c.py) to create the nodes of the AST. This all leads to a
213sequence of AST nodes stored in asdl_seq structs.
214
215
216Function and macros for creating and using ``asdl_seq *`` types as found
217in Python/asdl.c and Include/asdl.h:
218
219- ``asdl_seq_new(int)``
220 Allocate memory for an asdl_seq for length 'size'
221- ``asdl_seq_free(asdl_seq *)``
222 Free asdl_seq struct
223- ``asdl_seq_GET(asdl_seq *seq, int pos)``
224 Get item held at 'pos'
225- ``asdl_seq_SET(asdl_seq *seq, int pos, void *val)``
226 Set 'pos' in 'seq' to 'val'
227- ``asdl_seq_APPEND(asdl_seq *seq, void *val)``
228 Set the end of 'seq' to 'val'
229- ``asdl_seq_LEN(asdl_seq *)``
230 Return the length of 'seq'
231
232If you are working with statements, you must also worry about keeping
233track of what line number generated the statement. Currently the line
234number is passed as the last parameter to each stmt_ty function.
235
236
237Control Flow Graphs
238-------------------
239
240A control flow graph (often referenced by its acronym, CFG) is a
241directed graph that models the flow of a program using basic blocks that
242contain the intermediate representation (abbreviated "IR", and in this
243case is Python bytecode) within the blocks. Basic blocks themselves are
244a block of IR that has a single entry point but possibly multiple exit
245points. The single entry point is the key to basic blocks; it all has
246to do with jumps. An entry point is the target of something that
247changes control flow (such as a function call or a jump) while exit
248points are instructions that would change the flow of the program (such
249as jumps and 'return' statements). What this means is that a basic
250block is a chunk of code that starts at the entry point and runs to an
251exit point or the end of the block.
252
253As an example, consider an 'if' statement with an 'else' block. The
254guard on the 'if' is a basic block which is pointed to by the basic
255block containing the code leading to the 'if' statement. The 'if'
256statement block contains jumps (which are exit points) to the true body
257of the 'if' and the 'else' body (which may be NULL), each of which are
258their own basic blocks. Both of those blocks in turn point to the
259basic block representing the code following the entire 'if' statement.
260
261CFGs are usually one step away from final code output. Code is directly
262generated from the basic blocks (with jump targets adjusted based on the
263output order) by doing a post-order depth-first search on the CFG
264following the edges.
265
266
267AST to CFG to Bytecode
268----------------------
269
270With the AST created, the next step is to create the CFG. The first step
271is to convert the AST to Python bytecode without having jump targets
272resolved to specific offsets (this is calculated when the CFG goes to
273final bytecode). Essentially, this transforms the AST into Python
274bytecode with control flow represented by the edges of the CFG.
275
276Conversion is done in two passes. The first creates the namespace
277(variables can be classified as local, free/cell for closures, or
278global). With that done, the second pass essentially flattens the CFG
279into a list and calculates jump offsets for final output of bytecode.
280
281The conversion process is initiated by a call to the function in
282Python/newcompile.c::
283
284 PyCodeObject * PyAST_Compile(mod_ty, const char *, PyCompilerFlags);
285
286This function does both the conversion of the AST to a CFG and
287outputting final bytecode from the CFG. The AST to CFG step is handled
288mostly by the two functions called by PyAST_Compile()::
289
290 struct symtable * PySymtable_Build(mod_ty, const char *,
291 PyFutureFeatures);
292 PyCodeObject * compiler_mod(struct compiler *, mod_ty);
293
294The former is in Python/symtable.c while the latter is in
295Python/newcompile.c .
296
297PySymtable_Build() begins by entering the starting code block for the
298AST (passed-in) and then calling the proper symtable_visit_xx function
299(with xx being the AST node type). Next, the AST tree is walked with
300the various code blocks that delineate the reach of a local variable
301as blocks are entered and exited::
302
303 static int symtable_enter_block(struct symtable *, identifier,
304 block_ty, void *, int);
305 static int symtable_exit_block(struct symtable *, void *);
306
307Once the symbol table is created, it is time for CFG creation, whose
308code is in Python/newcompile.c . This is handled by several functions
309that break the task down by various AST node types. The functions are
310all named compiler_visit_xx where xx is the name of the node type (such
311as stmt, expr, etc.). Each function receives a ``struct compiler *``
312and xx_ty where xx is the AST node type. Typically these functions
313consist of a large 'switch' statement, branching based on the kind of
314node type passed to it. Simple things are handled inline in the
315'switch' statement with more complex transformations farmed out to other
316functions named compiler_xx with xx being a descriptive name of what is
317being handled.
318
319When transforming an arbitrary AST node, use the VISIT macro::
320
321 VISIT(struct compiler *, <node type>, <AST node>);
322
323The appropriate compiler_visit_xx function is called, based on the value
324passed in for <node type> (so ``VISIT(c, expr, node)`` calls
325``compiler_visit_expr(c, node)``). The VISIT_SEQ macro is very similar,
326 but is called on AST node sequences (those values that were created as
327arguments to a node that used the '*' modifier). There is also
328VISIT_SLICE just for handling slices::
329
330 VISIT_SLICE(struct compiler *, slice_ty, expr_context_ty);
331
332Emission of bytecode is handled by the following macros:
333
334- ``ADDOP(struct compiler *c, int op)``
335 add 'op' as an opcode
336- ``ADDOP_I(struct compiler *c, int op, int oparg)``
337 add 'op' with an 'oparg' argument
338- ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)``
339 add 'op' with the proper argument based on the position of obj in
340 'type', but with no handling of mangled names; used for when you
341 need to do named lookups of objects such as globals, consts, or
342 parameters where name mangling is not possible and the scope of the
343 name is known
344- ``ADDOP_NAME(struct compiler *, int, PyObject *, PyObject *)``
345 just like ADDOP_O, but name mangling is also handled; used for
346 attribute loading or importing based on name
347- ``ADDOP_JABS(struct compiling *c, int op, basicblock b)``
348 create an absolute jump to the basic block 'b'
349- ``ADDOP_JREL(struct compiling *c, int op, basicblock b)``
350 create a relative jump to the basic block 'b'
351
352Several helper functions that will emit bytecode and are named
353compiler_xx() where xx is what the function helps with (list, boolop
354 etc.). A rather useful one is::
355
356 static int compiler_nameop(struct compiler *, identifier,
357 expr_context_ty);
358
359This function looks up the scope of a variable and, based on the
360expression context, emits the proper opcode to load, store, or delete
361the variable.
362
363As for handling the line number on which a statement is defined, is
364handled by compiler_visit_stmt() and thus is not a worry.
365
366In addition to emitting bytecode based on the AST node, handling the
367creation of basic blocks must be done. Below are the macros and
368functions used for managing basic blocks:
369
370- ``NEW_BLOCK(struct compiler *)``
371 create block and set it as current
372- ``NEXT_BLOCK(struct compiler *)``
373 basically NEW_BLOCK() plus jump from current block
374- ``compiler_new_block(struct compiler *)``
375 create a block but don't use it (used for generating jumps)
376
377Once the CFG is created, it must be flattened and then final emission of
378bytecode occurs. Flattening is handled using a post-order depth-first
379search. Once flattened, jump offsets are backpatched based on the
380flattening and then a PyCodeObject file is created. All of this is
381handled by calling::
382
383 PyCodeObject * assemble(struct compiler *, int);
384
385*** NOTE: if you make a change here that can affect the output of bytecode that
386is already in existence, make sure to delete your old .py(c|o) files! Running
387``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick.
388
389
390Code Objects
391------------
392
393In the end, one ends up with a PyCodeObject which is defined in
394Include/code.h . And with that you now have executable Python bytecode!
395
396
397Modified Files
398--------------
399
400+ Parser/
401
402 - Python.asdl
403 ASDL syntax file
404
405 - asdl.py
406 "An implementation of the Zephyr Abstract Syntax Definition
407 Language." Uses SPARK_ to parse the ASDL files.
408
409 - asdl_c.py
410 "Generate C code from an ASDL description." Generates
411 ../Python/Python-ast.c and ../Include/Python-ast.h .
412
413 - spark.py
414 SPARK_ parser generator
415
416+ Python/
417
418 - Python-ast.c
419 Creates C structs corresponding to the ASDL types. Also
420 contains code for marshaling AST nodes (core ASDL types have
421 marshaling code in asdl.c). "File automatically generated by
422 ../Parser/asdl_c.py".
423
424 - asdl.c
425 Contains code to handle the ASDL sequence type. Also has code
426 to handle marshalling the core ASDL types, such as number and
427 identifier. used by Python-ast.c for marshaling AST nodes.
428
429 - ast.c
430 Converts Python's parse tree into the abstract syntax tree.
431
432 - compile.txt
433 This file.
434
435 - newcompile.c
436 New version of compile.c that handles the emitting of bytecode.
437
438 - symtable.c
439 Generates symbol table from AST.
440
441
442+ Include/
443
444 - Python-ast.h
445 Contains the actual definitions of the C structs as generated by
446 ../Python/Python-ast.c .
447 "Automatically generated by ../Parser/asdl_c.py".
448
449 - asdl.h
450 Header for the corresponding ../Python/ast.c .
451
452 - ast.h
453 Declares PyAST_FromNode() external (from ../Python/ast.c).
454
455 - code.h
456 Header file for ../Objects/codeobject.c; contains definition of
457 PyCodeObject.
458
459 - symtable.h
460 Header for ../Python/symtable.c . struct symtable and
461 PySTEntryObject are defined here.
462
463+ Objects/
464
465 - codeobject.c
466 Contains PyCodeObject-related code (originally in
467 ../Python/compile.c).
468
469
470ToDo
471----
472*** NOTE: all bugs and patches should be filed on SF under the group
473 "AST" for easy searching. It also does not hurt to put
474 "[AST]" at the beginning of the subject line of the tracker
475 item.
476
477+ Stdlib support
478 - AST->Python access?
479 - rewrite compiler package to mirror AST structure?
480+ Documentation
481 - flesh out this doc
482 * byte stream output
483 * explanation of how the symbol table pass works
484 * code object (PyCodeObject)
485+ Universal
486 - make sure entire test suite passes
487 - fix memory leaks
488 - make sure return types are properly checked for errors
489 - no gcc warnings
490
491References
492----------
493
494.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman.
495 `Compilers: Principles, Techniques, and Tools`,
496 http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108
497
498.. [Wang97] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris
499 S. Serra. `The Zephyr Abstract Syntax Description Language.`_
500 In Proceedings of the Conference on Domain-Specific Languages, pp.
501 213--227, 1997.
502
503.. _The Zephyr Abstract Syntax Description Language.:
504 http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97.html
505
506.. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/
507