| Developer Notes for Python Compiler |
| =================================== |
| |
| Table of Contents |
| ----------------- |
| |
| - Scope |
| Defines the limits of the change |
| - Parse Trees |
| Describes the local (Python) concept |
| - Abstract Syntax Trees (AST) |
| Describes the AST technology used |
| - Parse Tree to AST |
| Defines the transform approach |
| - Control Flow Graphs |
| Defines the creation of "basic blocks" |
| - AST to CFG to Bytecode |
| Tracks the flow from AST to bytecode |
| - Code Objects |
| Pointer to making bytecode "executable" |
| - Modified Files |
| Files added/modified/removed from CPython compiler |
| - ToDo |
| Work yet remaining (before complete) |
| - References |
| Academic and technical references to technology used. |
| |
| |
| Scope |
| ----- |
| |
| Historically (through 2.4), compilation from source code to bytecode |
| involved two steps: |
| |
| 1. Parse the source code into a parse tree (Parser/pgen.c) |
| 2. Emit bytecode based on the parse tree (Python/compile.c) |
| |
| Historically, this is not how a standard compiler works. The usual |
| steps for compilation are: |
| |
| 1. Parse source code into a parse tree (Parser/pgen.c) |
| 2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c) |
| 3. Transform AST into a Control Flow Graph (Python/newcompile.c) |
| 4. Emit bytecode based on the Control Flow Graph (Python/newcompile.c) |
| |
| Starting with Python 2.5, the above steps are now used. This change |
| was done to simplify compilation by breaking it into three steps. |
| The purpose of this document is to outline how the lattter three steps |
| of the process works. |
| |
| This document does not touch on how parsing works beyond what is needed |
| to explain what is needed for compilation. It is also not exhaustive |
| in terms of the how the entire system works. You will most likely need |
| to read some source to have an exact understanding of all details. |
| |
| |
| Parse Trees |
| ----------- |
| |
| Python's parser is an LL(1) parser mostly based off of the |
| implementation laid out in the Dragon Book [Aho86]_. |
| |
| The grammar file for Python can be found in Grammar/Grammar with the |
| numeric value of grammar rules are stored in Include/graminit.h. The |
| numeric values for types of tokens (literal tokens, such as ``:``, |
| numbers, etc.) are kept in Include/token.h). The parse tree made up of |
| ``node *`` structs (as defined in Include/node.h). |
| |
| Querying data from the node structs can be done with the following |
| macros (which are all defined in Include/token.h): |
| |
| - ``CHILD(node *, int)`` |
| Returns the nth child of the node using zero-offset indexing |
| - ``RCHILD(node *, int)`` |
| Returns the nth child of the node from the right side; use |
| negative numbers! |
| - ``NCH(node *)`` |
| Number of children the node has |
| - ``STR(node *)`` |
| String representation of the node; e.g., will return ``:`` for a |
| COLON token |
| - ``TYPE(node *)`` |
| The type of node as specified in ``Include/graminit.h`` |
| - ``REQ(node *, TYPE)`` |
| Assert that the node is the type that is expected |
| - ``LINENO(node *)`` |
| retrieve the line number of the source code that led to the |
| creation of the parse rule; defined in Python/ast.c |
| |
| To tie all of this example, consider the rule for 'while':: |
| |
| while_stmt: 'while' test ':' suite ['else' ':' suite] |
| |
| The node representing this will have ``TYPE(node) == while_stmt`` and |
| the number of children can be 4 or 7 depending on if there is an 'else' |
| statement. To access what should be the first ':' and require it be an |
| actual ':' token, `(REQ(CHILD(node, 2), COLON)``. |
| |
| |
| Abstract Syntax Trees (AST) |
| --------------------------- |
| |
| The abstract syntax tree (AST) is a high-level representation of the |
| program structure without the necessity of containing the source code; |
| it can be thought of a abstract representation of the source code. The |
| specification of the AST nodes is specified using the Zephyr Abstract |
| Syntax Definition Language (ASDL) [Wang97]_. |
| |
| The definition of the AST nodes for Python is found in the file |
| Parser/Python.asdl . |
| |
| Each AST node (representing statements, expressions, and several |
| specialized types, like list comprehensions and exception handlers) is |
| defined by the ASDL. Most definitions in the AST correspond to a |
| particular source construct, such as an 'if' statement or an attribute |
| lookup. The definition is independent of its realization in any |
| particular programming language. |
| |
| The following fragment of the Python ASDL construct demonstrates the |
| approach and syntax:: |
| |
| module Python |
| { |
| stmt = FunctionDef(identifier name, arguments args, stmt* body, |
| expr* decorators) |
| | Return(expr? value) | Yield(expr value) |
| attributes (int lineno) |
| } |
| |
| The preceding example describes three different kinds of statements; |
| function definitions, return statements, and yield statements. All |
| three kinds are considered of type stmt as shown by '|' separating the |
| various kinds. They all take arguments of various kinds and amounts. |
| |
| Modifiers on the argument type specify the number of values needed; '?' |
| means it is optional, '*' means 0 or more, no modifier means only one |
| value for the argument and it is required. FunctionDef, for instance, |
| takes an identifier for the name, 'arguments' for args, zero or more |
| stmt arguments for 'body', and zero or more expr arguments for |
| 'decorators'. |
| |
| Do notice that something like 'arguments', which is a node type, is |
| represented as a single AST node and not as a sequence of nodes as with |
| stmt as one might expect. |
| |
| All three kinds also have an 'attributes' argument; this is shown by the |
| fact that 'attributes' lacks a '|' before it. |
| |
| The statement definitions above generate the following C structure type:: |
| |
| typedef struct _stmt *stmt_ty; |
| |
| struct _stmt { |
| enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind; |
| union { |
| struct { |
| identifier name; |
| arguments_ty args; |
| asdl_seq *body; |
| } FunctionDef; |
| |
| struct { |
| expr_ty value; |
| } Return; |
| |
| struct { |
| expr_ty value; |
| } Yield; |
| } v; |
| int lineno; |
| } |
| |
| Also generated are a series of constructor functions that allocate (in |
| this case) a stmt_ty struct with the appropriate initialization. The |
| 'kind' field specifies which component of the union is initialized. The |
| FunctionDef() constructor function sets 'kind' to FunctionDef_kind and |
| initializes the 'name', 'args', 'body', and 'attributes' fields. |
| |
| *** NOTE: if you make a change here that can affect the output of bytecode that |
| is already in existence, make sure to delete your old .py(c|o) files! Running |
| ``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick. |
| |
| |
| Parse Tree to AST |
| ----------------- |
| |
| The AST is generated from the parse tree in (see Python/ast.c) using the |
| function:: |
| |
| mod_ty PyAST_FromNode(const node *n); |
| |
| The function begins a tree walk of the parse tree, creating various AST |
| nodes as it goes along. It does this by allocating all new nodes it |
| needs, calling the proper AST node creation functions for any required |
| supporting functions, and connecting them as needed. |
| |
| Do realize that there is no automated nor symbolic connection between |
| the grammar specification and the nodes in the parse tree. No help is |
| directly provided by the parse tree as in yacc. |
| |
| For instance, one must keep track of |
| which node in the parse tree one is working with (e.g., if you are |
| working with an 'if' statement you need to watch out for the ':' token |
| to find the end of the conditional). No help is directly provided by |
| the parse tree as in yacc. |
| |
| The functions called to generate AST nodes from the parse tree all have |
| the name ast_for_xx where xx is what the grammar rule that the function |
| handles (alias_for_import_name is the exception to this). These in turn |
| call the constructor functions as defined by the ASDL grammar and |
| contained in Python/Python-ast.c (which was generated by |
| Parser/asdl_c.py) to create the nodes of the AST. This all leads to a |
| sequence of AST nodes stored in asdl_seq structs. |
| |
| |
| Function and macros for creating and using ``asdl_seq *`` types as found |
| in Python/asdl.c and Include/asdl.h: |
| |
| - ``asdl_seq_new(int)`` |
| Allocate memory for an asdl_seq for length 'size' |
| - ``asdl_seq_free(asdl_seq *)`` |
| Free asdl_seq struct |
| - ``asdl_seq_GET(asdl_seq *seq, int pos)`` |
| Get item held at 'pos' |
| - ``asdl_seq_SET(asdl_seq *seq, int pos, void *val)`` |
| Set 'pos' in 'seq' to 'val' |
| - ``asdl_seq_APPEND(asdl_seq *seq, void *val)`` |
| Set the end of 'seq' to 'val' |
| - ``asdl_seq_LEN(asdl_seq *)`` |
| Return the length of 'seq' |
| |
| If you are working with statements, you must also worry about keeping |
| track of what line number generated the statement. Currently the line |
| number is passed as the last parameter to each stmt_ty function. |
| |
| |
| Control Flow Graphs |
| ------------------- |
| |
| A control flow graph (often referenced by its acronym, CFG) is a |
| directed graph that models the flow of a program using basic blocks that |
| contain the intermediate representation (abbreviated "IR", and in this |
| case is Python bytecode) within the blocks. Basic blocks themselves are |
| a block of IR that has a single entry point but possibly multiple exit |
| points. The single entry point is the key to basic blocks; it all has |
| to do with jumps. An entry point is the target of something that |
| changes control flow (such as a function call or a jump) while exit |
| points are instructions that would change the flow of the program (such |
| as jumps and 'return' statements). What this means is that a basic |
| block is a chunk of code that starts at the entry point and runs to an |
| exit point or the end of the block. |
| |
| As an example, consider an 'if' statement with an 'else' block. The |
| guard on the 'if' is a basic block which is pointed to by the basic |
| block containing the code leading to the 'if' statement. The 'if' |
| statement block contains jumps (which are exit points) to the true body |
| of the 'if' and the 'else' body (which may be NULL), each of which are |
| their own basic blocks. Both of those blocks in turn point to the |
| basic block representing the code following the entire 'if' statement. |
| |
| CFGs are usually one step away from final code output. Code is directly |
| generated from the basic blocks (with jump targets adjusted based on the |
| output order) by doing a post-order depth-first search on the CFG |
| following the edges. |
| |
| |
| AST to CFG to Bytecode |
| ---------------------- |
| |
| With the AST created, the next step is to create the CFG. The first step |
| is to convert the AST to Python bytecode without having jump targets |
| resolved to specific offsets (this is calculated when the CFG goes to |
| final bytecode). Essentially, this transforms the AST into Python |
| bytecode with control flow represented by the edges of the CFG. |
| |
| Conversion is done in two passes. The first creates the namespace |
| (variables can be classified as local, free/cell for closures, or |
| global). With that done, the second pass essentially flattens the CFG |
| into a list and calculates jump offsets for final output of bytecode. |
| |
| The conversion process is initiated by a call to the function in |
| Python/newcompile.c:: |
| |
| PyCodeObject * PyAST_Compile(mod_ty, const char *, PyCompilerFlags); |
| |
| This function does both the conversion of the AST to a CFG and |
| outputting final bytecode from the CFG. The AST to CFG step is handled |
| mostly by the two functions called by PyAST_Compile():: |
| |
| struct symtable * PySymtable_Build(mod_ty, const char *, |
| PyFutureFeatures); |
| PyCodeObject * compiler_mod(struct compiler *, mod_ty); |
| |
| The former is in Python/symtable.c while the latter is in |
| Python/newcompile.c . |
| |
| PySymtable_Build() begins by entering the starting code block for the |
| AST (passed-in) and then calling the proper symtable_visit_xx function |
| (with xx being the AST node type). Next, the AST tree is walked with |
| the various code blocks that delineate the reach of a local variable |
| as blocks are entered and exited:: |
| |
| static int symtable_enter_block(struct symtable *, identifier, |
| block_ty, void *, int); |
| static int symtable_exit_block(struct symtable *, void *); |
| |
| Once the symbol table is created, it is time for CFG creation, whose |
| code is in Python/newcompile.c . This is handled by several functions |
| that break the task down by various AST node types. The functions are |
| all named compiler_visit_xx where xx is the name of the node type (such |
| as stmt, expr, etc.). Each function receives a ``struct compiler *`` |
| and xx_ty where xx is the AST node type. Typically these functions |
| consist of a large 'switch' statement, branching based on the kind of |
| node type passed to it. Simple things are handled inline in the |
| 'switch' statement with more complex transformations farmed out to other |
| functions named compiler_xx with xx being a descriptive name of what is |
| being handled. |
| |
| When transforming an arbitrary AST node, use the VISIT macro:: |
| |
| VISIT(struct compiler *, <node type>, <AST node>); |
| |
| The appropriate compiler_visit_xx function is called, based on the value |
| passed in for <node type> (so ``VISIT(c, expr, node)`` calls |
| ``compiler_visit_expr(c, node)``). The VISIT_SEQ macro is very similar, |
| but is called on AST node sequences (those values that were created as |
| arguments to a node that used the '*' modifier). There is also |
| VISIT_SLICE just for handling slices:: |
| |
| VISIT_SLICE(struct compiler *, slice_ty, expr_context_ty); |
| |
| Emission of bytecode is handled by the following macros: |
| |
| - ``ADDOP(struct compiler *c, int op)`` |
| add 'op' as an opcode |
| - ``ADDOP_I(struct compiler *c, int op, int oparg)`` |
| add 'op' with an 'oparg' argument |
| - ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)`` |
| add 'op' with the proper argument based on the position of obj in |
| 'type', but with no handling of mangled names; used for when you |
| need to do named lookups of objects such as globals, consts, or |
| parameters where name mangling is not possible and the scope of the |
| name is known |
| - ``ADDOP_NAME(struct compiler *, int, PyObject *, PyObject *)`` |
| just like ADDOP_O, but name mangling is also handled; used for |
| attribute loading or importing based on name |
| - ``ADDOP_JABS(struct compiling *c, int op, basicblock b)`` |
| create an absolute jump to the basic block 'b' |
| - ``ADDOP_JREL(struct compiling *c, int op, basicblock b)`` |
| create a relative jump to the basic block 'b' |
| |
| Several helper functions that will emit bytecode and are named |
| compiler_xx() where xx is what the function helps with (list, boolop |
| etc.). A rather useful one is:: |
| |
| static int compiler_nameop(struct compiler *, identifier, |
| expr_context_ty); |
| |
| This function looks up the scope of a variable and, based on the |
| expression context, emits the proper opcode to load, store, or delete |
| the variable. |
| |
| As for handling the line number on which a statement is defined, is |
| handled by compiler_visit_stmt() and thus is not a worry. |
| |
| In addition to emitting bytecode based on the AST node, handling the |
| creation of basic blocks must be done. Below are the macros and |
| functions used for managing basic blocks: |
| |
| - ``NEW_BLOCK(struct compiler *)`` |
| create block and set it as current |
| - ``NEXT_BLOCK(struct compiler *)`` |
| basically NEW_BLOCK() plus jump from current block |
| - ``compiler_new_block(struct compiler *)`` |
| create a block but don't use it (used for generating jumps) |
| |
| Once the CFG is created, it must be flattened and then final emission of |
| bytecode occurs. Flattening is handled using a post-order depth-first |
| search. Once flattened, jump offsets are backpatched based on the |
| flattening and then a PyCodeObject file is created. All of this is |
| handled by calling:: |
| |
| PyCodeObject * assemble(struct compiler *, int); |
| |
| *** NOTE: if you make a change here that can affect the output of bytecode that |
| is already in existence, make sure to delete your old .py(c|o) files! Running |
| ``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick. |
| |
| |
| Code Objects |
| ------------ |
| |
| In the end, one ends up with a PyCodeObject which is defined in |
| Include/code.h . And with that you now have executable Python bytecode! |
| |
| |
| Modified Files |
| -------------- |
| |
| + Parser/ |
| |
| - Python.asdl |
| ASDL syntax file |
| |
| - asdl.py |
| "An implementation of the Zephyr Abstract Syntax Definition |
| Language." Uses SPARK_ to parse the ASDL files. |
| |
| - asdl_c.py |
| "Generate C code from an ASDL description." Generates |
| ../Python/Python-ast.c and ../Include/Python-ast.h . |
| |
| - spark.py |
| SPARK_ parser generator |
| |
| + Python/ |
| |
| - Python-ast.c |
| Creates C structs corresponding to the ASDL types. Also |
| contains code for marshaling AST nodes (core ASDL types have |
| marshaling code in asdl.c). "File automatically generated by |
| ../Parser/asdl_c.py". |
| |
| - asdl.c |
| Contains code to handle the ASDL sequence type. Also has code |
| to handle marshalling the core ASDL types, such as number and |
| identifier. used by Python-ast.c for marshaling AST nodes. |
| |
| - ast.c |
| Converts Python's parse tree into the abstract syntax tree. |
| |
| - compile.txt |
| This file. |
| |
| - newcompile.c |
| New version of compile.c that handles the emitting of bytecode. |
| |
| - symtable.c |
| Generates symbol table from AST. |
| |
| |
| + Include/ |
| |
| - Python-ast.h |
| Contains the actual definitions of the C structs as generated by |
| ../Python/Python-ast.c . |
| "Automatically generated by ../Parser/asdl_c.py". |
| |
| - asdl.h |
| Header for the corresponding ../Python/ast.c . |
| |
| - ast.h |
| Declares PyAST_FromNode() external (from ../Python/ast.c). |
| |
| - code.h |
| Header file for ../Objects/codeobject.c; contains definition of |
| PyCodeObject. |
| |
| - symtable.h |
| Header for ../Python/symtable.c . struct symtable and |
| PySTEntryObject are defined here. |
| |
| + Objects/ |
| |
| - codeobject.c |
| Contains PyCodeObject-related code (originally in |
| ../Python/compile.c). |
| |
| |
| ToDo |
| ---- |
| *** NOTE: all bugs and patches should be filed on SF under the group |
| "AST" for easy searching. It also does not hurt to put |
| "[AST]" at the beginning of the subject line of the tracker |
| item. |
| |
| + Stdlib support |
| - AST->Python access? |
| - rewrite compiler package to mirror AST structure? |
| + Documentation |
| - flesh out this doc |
| * byte stream output |
| * explanation of how the symbol table pass works |
| * code object (PyCodeObject) |
| + Universal |
| - make sure entire test suite passes |
| - fix memory leaks |
| - make sure return types are properly checked for errors |
| - no gcc warnings |
| |
| References |
| ---------- |
| |
| .. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. |
| `Compilers: Principles, Techniques, and Tools`, |
| http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108 |
| |
| .. [Wang97] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris |
| S. Serra. `The Zephyr Abstract Syntax Description Language.`_ |
| In Proceedings of the Conference on Domain-Specific Languages, pp. |
| 213--227, 1997. |
| |
| .. _The Zephyr Abstract Syntax Description Language.: |
| http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97.html |
| |
| .. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/ |
| |