Brett Cannon | 82a9394 | 2006-02-09 02:43:14 +0000 | [diff] [blame] | 1 | Developer Notes for Python Compiler |
| 2 | =================================== |
| 3 | |
| 4 | Table of Contents |
| 5 | ----------------- |
| 6 | |
| 7 | - Scope |
| 8 | Defines the limits of the change |
| 9 | - Parse Trees |
| 10 | Describes the local (Python) concept |
| 11 | - Abstract Syntax Trees (AST) |
| 12 | Describes the AST technology used |
| 13 | - Parse Tree to AST |
| 14 | Defines the transform approach |
| 15 | - Control Flow Graphs |
| 16 | Defines the creation of "basic blocks" |
| 17 | - AST to CFG to Bytecode |
| 18 | Tracks the flow from AST to bytecode |
| 19 | - Code Objects |
| 20 | Pointer to making bytecode "executable" |
| 21 | - Modified Files |
| 22 | Files added/modified/removed from CPython compiler |
| 23 | - ToDo |
| 24 | Work yet remaining (before complete) |
| 25 | - References |
| 26 | Academic and technical references to technology used. |
| 27 | |
| 28 | |
| 29 | Scope |
| 30 | ----- |
| 31 | |
| 32 | Historically (through 2.4), compilation from source code to bytecode |
| 33 | involved two steps: |
| 34 | |
| 35 | 1. Parse the source code into a parse tree (Parser/pgen.c) |
| 36 | 2. Emit bytecode based on the parse tree (Python/compile.c) |
| 37 | |
| 38 | Historically, this is not how a standard compiler works. The usual |
| 39 | steps for compilation are: |
| 40 | |
| 41 | 1. Parse source code into a parse tree (Parser/pgen.c) |
| 42 | 2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c) |
| 43 | 3. Transform AST into a Control Flow Graph (Python/newcompile.c) |
| 44 | 4. Emit bytecode based on the Control Flow Graph (Python/newcompile.c) |
| 45 | |
| 46 | Starting with Python 2.5, the above steps are now used. This change |
| 47 | was done to simplify compilation by breaking it into three steps. |
| 48 | The purpose of this document is to outline how the lattter three steps |
| 49 | of the process works. |
| 50 | |
| 51 | This document does not touch on how parsing works beyond what is needed |
| 52 | to explain what is needed for compilation. It is also not exhaustive |
| 53 | in terms of the how the entire system works. You will most likely need |
| 54 | to read some source to have an exact understanding of all details. |
| 55 | |
| 56 | |
| 57 | Parse Trees |
| 58 | ----------- |
| 59 | |
| 60 | Python's parser is an LL(1) parser mostly based off of the |
| 61 | implementation laid out in the Dragon Book [Aho86]_. |
| 62 | |
| 63 | The grammar file for Python can be found in Grammar/Grammar with the |
| 64 | numeric value of grammar rules are stored in Include/graminit.h. The |
| 65 | numeric values for types of tokens (literal tokens, such as ``:``, |
| 66 | numbers, etc.) are kept in Include/token.h). The parse tree made up of |
| 67 | ``node *`` structs (as defined in Include/node.h). |
| 68 | |
| 69 | Querying data from the node structs can be done with the following |
| 70 | macros (which are all defined in Include/token.h): |
| 71 | |
| 72 | - ``CHILD(node *, int)`` |
| 73 | Returns the nth child of the node using zero-offset indexing |
| 74 | - ``RCHILD(node *, int)`` |
| 75 | Returns the nth child of the node from the right side; use |
| 76 | negative numbers! |
| 77 | - ``NCH(node *)`` |
| 78 | Number of children the node has |
| 79 | - ``STR(node *)`` |
| 80 | String representation of the node; e.g., will return ``:`` for a |
| 81 | COLON token |
| 82 | - ``TYPE(node *)`` |
| 83 | The type of node as specified in ``Include/graminit.h`` |
| 84 | - ``REQ(node *, TYPE)`` |
| 85 | Assert that the node is the type that is expected |
| 86 | - ``LINENO(node *)`` |
| 87 | retrieve the line number of the source code that led to the |
| 88 | creation of the parse rule; defined in Python/ast.c |
| 89 | |
| 90 | To tie all of this example, consider the rule for 'while':: |
| 91 | |
| 92 | while_stmt: 'while' test ':' suite ['else' ':' suite] |
| 93 | |
| 94 | The node representing this will have ``TYPE(node) == while_stmt`` and |
| 95 | the number of children can be 4 or 7 depending on if there is an 'else' |
| 96 | statement. To access what should be the first ':' and require it be an |
| 97 | actual ':' token, `(REQ(CHILD(node, 2), COLON)``. |
| 98 | |
| 99 | |
| 100 | Abstract Syntax Trees (AST) |
| 101 | --------------------------- |
| 102 | |
| 103 | The abstract syntax tree (AST) is a high-level representation of the |
| 104 | program structure without the necessity of containing the source code; |
| 105 | it can be thought of a abstract representation of the source code. The |
| 106 | specification of the AST nodes is specified using the Zephyr Abstract |
| 107 | Syntax Definition Language (ASDL) [Wang97]_. |
| 108 | |
| 109 | The definition of the AST nodes for Python is found in the file |
| 110 | Parser/Python.asdl . |
| 111 | |
| 112 | Each AST node (representing statements, expressions, and several |
| 113 | specialized types, like list comprehensions and exception handlers) is |
| 114 | defined by the ASDL. Most definitions in the AST correspond to a |
| 115 | particular source construct, such as an 'if' statement or an attribute |
| 116 | lookup. The definition is independent of its realization in any |
| 117 | particular programming language. |
| 118 | |
| 119 | The following fragment of the Python ASDL construct demonstrates the |
| 120 | approach and syntax:: |
| 121 | |
| 122 | module Python |
| 123 | { |
| 124 | stmt = FunctionDef(identifier name, arguments args, stmt* body, |
| 125 | expr* decorators) |
| 126 | | Return(expr? value) | Yield(expr value) |
| 127 | attributes (int lineno) |
| 128 | } |
| 129 | |
| 130 | The preceding example describes three different kinds of statements; |
| 131 | function definitions, return statements, and yield statements. All |
| 132 | three kinds are considered of type stmt as shown by '|' separating the |
| 133 | various kinds. They all take arguments of various kinds and amounts. |
| 134 | |
| 135 | Modifiers on the argument type specify the number of values needed; '?' |
| 136 | means it is optional, '*' means 0 or more, no modifier means only one |
| 137 | value for the argument and it is required. FunctionDef, for instance, |
| 138 | takes an identifier for the name, 'arguments' for args, zero or more |
| 139 | stmt arguments for 'body', and zero or more expr arguments for |
| 140 | 'decorators'. |
| 141 | |
| 142 | Do notice that something like 'arguments', which is a node type, is |
| 143 | represented as a single AST node and not as a sequence of nodes as with |
| 144 | stmt as one might expect. |
| 145 | |
| 146 | All three kinds also have an 'attributes' argument; this is shown by the |
| 147 | fact that 'attributes' lacks a '|' before it. |
| 148 | |
| 149 | The statement definitions above generate the following C structure type:: |
| 150 | |
| 151 | typedef struct _stmt *stmt_ty; |
| 152 | |
| 153 | struct _stmt { |
| 154 | enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind; |
| 155 | union { |
| 156 | struct { |
| 157 | identifier name; |
| 158 | arguments_ty args; |
| 159 | asdl_seq *body; |
| 160 | } FunctionDef; |
| 161 | |
| 162 | struct { |
| 163 | expr_ty value; |
| 164 | } Return; |
| 165 | |
| 166 | struct { |
| 167 | expr_ty value; |
| 168 | } Yield; |
| 169 | } v; |
| 170 | int lineno; |
| 171 | } |
| 172 | |
| 173 | Also generated are a series of constructor functions that allocate (in |
| 174 | this case) a stmt_ty struct with the appropriate initialization. The |
| 175 | 'kind' field specifies which component of the union is initialized. The |
| 176 | FunctionDef() constructor function sets 'kind' to FunctionDef_kind and |
| 177 | initializes the 'name', 'args', 'body', and 'attributes' fields. |
| 178 | |
| 179 | *** NOTE: if you make a change here that can affect the output of bytecode that |
| 180 | is already in existence, make sure to delete your old .py(c|o) files! Running |
| 181 | ``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick. |
| 182 | |
| 183 | |
| 184 | Parse Tree to AST |
| 185 | ----------------- |
| 186 | |
| 187 | The AST is generated from the parse tree in (see Python/ast.c) using the |
| 188 | function:: |
| 189 | |
| 190 | mod_ty PyAST_FromNode(const node *n); |
| 191 | |
| 192 | The function begins a tree walk of the parse tree, creating various AST |
| 193 | nodes as it goes along. It does this by allocating all new nodes it |
| 194 | needs, calling the proper AST node creation functions for any required |
| 195 | supporting functions, and connecting them as needed. |
| 196 | |
| 197 | Do realize that there is no automated nor symbolic connection between |
| 198 | the grammar specification and the nodes in the parse tree. No help is |
| 199 | directly provided by the parse tree as in yacc. |
| 200 | |
| 201 | For instance, one must keep track of |
| 202 | which node in the parse tree one is working with (e.g., if you are |
| 203 | working with an 'if' statement you need to watch out for the ':' token |
| 204 | to find the end of the conditional). No help is directly provided by |
| 205 | the parse tree as in yacc. |
| 206 | |
| 207 | The functions called to generate AST nodes from the parse tree all have |
| 208 | the name ast_for_xx where xx is what the grammar rule that the function |
| 209 | handles (alias_for_import_name is the exception to this). These in turn |
| 210 | call the constructor functions as defined by the ASDL grammar and |
| 211 | contained in Python/Python-ast.c (which was generated by |
| 212 | Parser/asdl_c.py) to create the nodes of the AST. This all leads to a |
| 213 | sequence of AST nodes stored in asdl_seq structs. |
| 214 | |
| 215 | |
| 216 | Function and macros for creating and using ``asdl_seq *`` types as found |
| 217 | in Python/asdl.c and Include/asdl.h: |
| 218 | |
| 219 | - ``asdl_seq_new(int)`` |
| 220 | Allocate memory for an asdl_seq for length 'size' |
| 221 | - ``asdl_seq_free(asdl_seq *)`` |
| 222 | Free asdl_seq struct |
| 223 | - ``asdl_seq_GET(asdl_seq *seq, int pos)`` |
| 224 | Get item held at 'pos' |
| 225 | - ``asdl_seq_SET(asdl_seq *seq, int pos, void *val)`` |
| 226 | Set 'pos' in 'seq' to 'val' |
| 227 | - ``asdl_seq_APPEND(asdl_seq *seq, void *val)`` |
| 228 | Set the end of 'seq' to 'val' |
| 229 | - ``asdl_seq_LEN(asdl_seq *)`` |
| 230 | Return the length of 'seq' |
| 231 | |
| 232 | If you are working with statements, you must also worry about keeping |
| 233 | track of what line number generated the statement. Currently the line |
| 234 | number is passed as the last parameter to each stmt_ty function. |
| 235 | |
| 236 | |
| 237 | Control Flow Graphs |
| 238 | ------------------- |
| 239 | |
| 240 | A control flow graph (often referenced by its acronym, CFG) is a |
| 241 | directed graph that models the flow of a program using basic blocks that |
| 242 | contain the intermediate representation (abbreviated "IR", and in this |
| 243 | case is Python bytecode) within the blocks. Basic blocks themselves are |
| 244 | a block of IR that has a single entry point but possibly multiple exit |
| 245 | points. The single entry point is the key to basic blocks; it all has |
| 246 | to do with jumps. An entry point is the target of something that |
| 247 | changes control flow (such as a function call or a jump) while exit |
| 248 | points are instructions that would change the flow of the program (such |
| 249 | as jumps and 'return' statements). What this means is that a basic |
| 250 | block is a chunk of code that starts at the entry point and runs to an |
| 251 | exit point or the end of the block. |
| 252 | |
| 253 | As an example, consider an 'if' statement with an 'else' block. The |
| 254 | guard on the 'if' is a basic block which is pointed to by the basic |
| 255 | block containing the code leading to the 'if' statement. The 'if' |
| 256 | statement block contains jumps (which are exit points) to the true body |
| 257 | of the 'if' and the 'else' body (which may be NULL), each of which are |
| 258 | their own basic blocks. Both of those blocks in turn point to the |
| 259 | basic block representing the code following the entire 'if' statement. |
| 260 | |
| 261 | CFGs are usually one step away from final code output. Code is directly |
| 262 | generated from the basic blocks (with jump targets adjusted based on the |
| 263 | output order) by doing a post-order depth-first search on the CFG |
| 264 | following the edges. |
| 265 | |
| 266 | |
| 267 | AST to CFG to Bytecode |
| 268 | ---------------------- |
| 269 | |
| 270 | With the AST created, the next step is to create the CFG. The first step |
| 271 | is to convert the AST to Python bytecode without having jump targets |
| 272 | resolved to specific offsets (this is calculated when the CFG goes to |
| 273 | final bytecode). Essentially, this transforms the AST into Python |
| 274 | bytecode with control flow represented by the edges of the CFG. |
| 275 | |
| 276 | Conversion is done in two passes. The first creates the namespace |
| 277 | (variables can be classified as local, free/cell for closures, or |
| 278 | global). With that done, the second pass essentially flattens the CFG |
| 279 | into a list and calculates jump offsets for final output of bytecode. |
| 280 | |
| 281 | The conversion process is initiated by a call to the function in |
| 282 | Python/newcompile.c:: |
| 283 | |
| 284 | PyCodeObject * PyAST_Compile(mod_ty, const char *, PyCompilerFlags); |
| 285 | |
| 286 | This function does both the conversion of the AST to a CFG and |
| 287 | outputting final bytecode from the CFG. The AST to CFG step is handled |
| 288 | mostly by the two functions called by PyAST_Compile():: |
| 289 | |
| 290 | struct symtable * PySymtable_Build(mod_ty, const char *, |
| 291 | PyFutureFeatures); |
| 292 | PyCodeObject * compiler_mod(struct compiler *, mod_ty); |
| 293 | |
| 294 | The former is in Python/symtable.c while the latter is in |
| 295 | Python/newcompile.c . |
| 296 | |
| 297 | PySymtable_Build() begins by entering the starting code block for the |
| 298 | AST (passed-in) and then calling the proper symtable_visit_xx function |
| 299 | (with xx being the AST node type). Next, the AST tree is walked with |
| 300 | the various code blocks that delineate the reach of a local variable |
| 301 | as blocks are entered and exited:: |
| 302 | |
| 303 | static int symtable_enter_block(struct symtable *, identifier, |
| 304 | block_ty, void *, int); |
| 305 | static int symtable_exit_block(struct symtable *, void *); |
| 306 | |
| 307 | Once the symbol table is created, it is time for CFG creation, whose |
| 308 | code is in Python/newcompile.c . This is handled by several functions |
| 309 | that break the task down by various AST node types. The functions are |
| 310 | all named compiler_visit_xx where xx is the name of the node type (such |
| 311 | as stmt, expr, etc.). Each function receives a ``struct compiler *`` |
| 312 | and xx_ty where xx is the AST node type. Typically these functions |
| 313 | consist of a large 'switch' statement, branching based on the kind of |
| 314 | node type passed to it. Simple things are handled inline in the |
| 315 | 'switch' statement with more complex transformations farmed out to other |
| 316 | functions named compiler_xx with xx being a descriptive name of what is |
| 317 | being handled. |
| 318 | |
| 319 | When transforming an arbitrary AST node, use the VISIT macro:: |
| 320 | |
| 321 | VISIT(struct compiler *, <node type>, <AST node>); |
| 322 | |
| 323 | The appropriate compiler_visit_xx function is called, based on the value |
| 324 | passed in for <node type> (so ``VISIT(c, expr, node)`` calls |
| 325 | ``compiler_visit_expr(c, node)``). The VISIT_SEQ macro is very similar, |
| 326 | but is called on AST node sequences (those values that were created as |
| 327 | arguments to a node that used the '*' modifier). There is also |
| 328 | VISIT_SLICE just for handling slices:: |
| 329 | |
| 330 | VISIT_SLICE(struct compiler *, slice_ty, expr_context_ty); |
| 331 | |
| 332 | Emission of bytecode is handled by the following macros: |
| 333 | |
| 334 | - ``ADDOP(struct compiler *c, int op)`` |
| 335 | add 'op' as an opcode |
| 336 | - ``ADDOP_I(struct compiler *c, int op, int oparg)`` |
| 337 | add 'op' with an 'oparg' argument |
| 338 | - ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)`` |
| 339 | add 'op' with the proper argument based on the position of obj in |
| 340 | 'type', but with no handling of mangled names; used for when you |
| 341 | need to do named lookups of objects such as globals, consts, or |
| 342 | parameters where name mangling is not possible and the scope of the |
| 343 | name is known |
| 344 | - ``ADDOP_NAME(struct compiler *, int, PyObject *, PyObject *)`` |
| 345 | just like ADDOP_O, but name mangling is also handled; used for |
| 346 | attribute loading or importing based on name |
| 347 | - ``ADDOP_JABS(struct compiling *c, int op, basicblock b)`` |
| 348 | create an absolute jump to the basic block 'b' |
| 349 | - ``ADDOP_JREL(struct compiling *c, int op, basicblock b)`` |
| 350 | create a relative jump to the basic block 'b' |
| 351 | |
| 352 | Several helper functions that will emit bytecode and are named |
| 353 | compiler_xx() where xx is what the function helps with (list, boolop |
| 354 | etc.). A rather useful one is:: |
| 355 | |
| 356 | static int compiler_nameop(struct compiler *, identifier, |
| 357 | expr_context_ty); |
| 358 | |
| 359 | This function looks up the scope of a variable and, based on the |
| 360 | expression context, emits the proper opcode to load, store, or delete |
| 361 | the variable. |
| 362 | |
| 363 | As for handling the line number on which a statement is defined, is |
| 364 | handled by compiler_visit_stmt() and thus is not a worry. |
| 365 | |
| 366 | In addition to emitting bytecode based on the AST node, handling the |
| 367 | creation of basic blocks must be done. Below are the macros and |
| 368 | functions used for managing basic blocks: |
| 369 | |
| 370 | - ``NEW_BLOCK(struct compiler *)`` |
| 371 | create block and set it as current |
| 372 | - ``NEXT_BLOCK(struct compiler *)`` |
| 373 | basically NEW_BLOCK() plus jump from current block |
| 374 | - ``compiler_new_block(struct compiler *)`` |
| 375 | create a block but don't use it (used for generating jumps) |
| 376 | |
| 377 | Once the CFG is created, it must be flattened and then final emission of |
| 378 | bytecode occurs. Flattening is handled using a post-order depth-first |
| 379 | search. Once flattened, jump offsets are backpatched based on the |
| 380 | flattening and then a PyCodeObject file is created. All of this is |
| 381 | handled by calling:: |
| 382 | |
| 383 | PyCodeObject * assemble(struct compiler *, int); |
| 384 | |
| 385 | *** NOTE: if you make a change here that can affect the output of bytecode that |
| 386 | is already in existence, make sure to delete your old .py(c|o) files! Running |
| 387 | ``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick. |
| 388 | |
| 389 | |
| 390 | Code Objects |
| 391 | ------------ |
| 392 | |
| 393 | In the end, one ends up with a PyCodeObject which is defined in |
| 394 | Include/code.h . And with that you now have executable Python bytecode! |
| 395 | |
| 396 | |
| 397 | Modified Files |
| 398 | -------------- |
| 399 | |
| 400 | + Parser/ |
| 401 | |
| 402 | - Python.asdl |
| 403 | ASDL syntax file |
| 404 | |
| 405 | - asdl.py |
| 406 | "An implementation of the Zephyr Abstract Syntax Definition |
| 407 | Language." Uses SPARK_ to parse the ASDL files. |
| 408 | |
| 409 | - asdl_c.py |
| 410 | "Generate C code from an ASDL description." Generates |
| 411 | ../Python/Python-ast.c and ../Include/Python-ast.h . |
| 412 | |
| 413 | - spark.py |
| 414 | SPARK_ parser generator |
| 415 | |
| 416 | + Python/ |
| 417 | |
| 418 | - Python-ast.c |
| 419 | Creates C structs corresponding to the ASDL types. Also |
| 420 | contains code for marshaling AST nodes (core ASDL types have |
| 421 | marshaling code in asdl.c). "File automatically generated by |
| 422 | ../Parser/asdl_c.py". |
| 423 | |
| 424 | - asdl.c |
| 425 | Contains code to handle the ASDL sequence type. Also has code |
| 426 | to handle marshalling the core ASDL types, such as number and |
| 427 | identifier. used by Python-ast.c for marshaling AST nodes. |
| 428 | |
| 429 | - ast.c |
| 430 | Converts Python's parse tree into the abstract syntax tree. |
| 431 | |
| 432 | - compile.txt |
| 433 | This file. |
| 434 | |
| 435 | - newcompile.c |
| 436 | New version of compile.c that handles the emitting of bytecode. |
| 437 | |
| 438 | - symtable.c |
| 439 | Generates symbol table from AST. |
| 440 | |
| 441 | |
| 442 | + Include/ |
| 443 | |
| 444 | - Python-ast.h |
| 445 | Contains the actual definitions of the C structs as generated by |
| 446 | ../Python/Python-ast.c . |
| 447 | "Automatically generated by ../Parser/asdl_c.py". |
| 448 | |
| 449 | - asdl.h |
| 450 | Header for the corresponding ../Python/ast.c . |
| 451 | |
| 452 | - ast.h |
| 453 | Declares PyAST_FromNode() external (from ../Python/ast.c). |
| 454 | |
| 455 | - code.h |
| 456 | Header file for ../Objects/codeobject.c; contains definition of |
| 457 | PyCodeObject. |
| 458 | |
| 459 | - symtable.h |
| 460 | Header for ../Python/symtable.c . struct symtable and |
| 461 | PySTEntryObject are defined here. |
| 462 | |
| 463 | + Objects/ |
| 464 | |
| 465 | - codeobject.c |
| 466 | Contains PyCodeObject-related code (originally in |
| 467 | ../Python/compile.c). |
| 468 | |
| 469 | |
| 470 | ToDo |
| 471 | ---- |
| 472 | *** NOTE: all bugs and patches should be filed on SF under the group |
| 473 | "AST" for easy searching. It also does not hurt to put |
| 474 | "[AST]" at the beginning of the subject line of the tracker |
| 475 | item. |
| 476 | |
| 477 | + Stdlib support |
| 478 | - AST->Python access? |
| 479 | - rewrite compiler package to mirror AST structure? |
| 480 | + Documentation |
| 481 | - flesh out this doc |
| 482 | * byte stream output |
| 483 | * explanation of how the symbol table pass works |
| 484 | * code object (PyCodeObject) |
| 485 | + Universal |
| 486 | - make sure entire test suite passes |
| 487 | - fix memory leaks |
| 488 | - make sure return types are properly checked for errors |
| 489 | - no gcc warnings |
| 490 | |
| 491 | References |
| 492 | ---------- |
| 493 | |
| 494 | .. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. |
| 495 | `Compilers: Principles, Techniques, and Tools`, |
| 496 | http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108 |
| 497 | |
| 498 | .. [Wang97] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris |
| 499 | S. Serra. `The Zephyr Abstract Syntax Description Language.`_ |
| 500 | In Proceedings of the Conference on Domain-Specific Languages, pp. |
| 501 | 213--227, 1997. |
| 502 | |
| 503 | .. _The Zephyr Abstract Syntax Description Language.: |
| 504 | http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97.html |
| 505 | |
| 506 | .. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/ |
| 507 | |