|  | ====================================== | 
|  | Kaleidoscope: Adding Debug Information | 
|  | ====================================== | 
|  |  | 
|  | .. contents:: | 
|  | :local: | 
|  |  | 
|  | Chapter 9 Introduction | 
|  | ====================== | 
|  |  | 
|  | Welcome to Chapter 9 of the "`Implementing a language with | 
|  | LLVM <index.html>`_" tutorial. In chapters 1 through 8, we've built a | 
|  | decent little programming language with functions and variables. | 
|  | What happens if something goes wrong though, how do you debug your | 
|  | program? | 
|  |  | 
|  | Source level debugging uses formatted data that helps a debugger | 
|  | translate from binary and the state of the machine back to the | 
|  | source that the programmer wrote. In LLVM we generally use a format | 
|  | called `DWARF <http://dwarfstd.org>`_. DWARF is a compact encoding | 
|  | that represents types, source locations, and variable locations. | 
|  |  | 
|  | The short summary of this chapter is that we'll go through the | 
|  | various things you have to add to a programming language to | 
|  | support debug info, and how you translate that into DWARF. | 
|  |  | 
|  | Caveat: For now we can't debug via the JIT, so we'll need to compile | 
|  | our program down to something small and standalone. As part of this | 
|  | we'll make a few modifications to the running of the language and | 
|  | how programs are compiled. This means that we'll have a source file | 
|  | with a simple program written in Kaleidoscope rather than the | 
|  | interactive JIT. It does involve a limitation that we can only | 
|  | have one "top level" command at a time to reduce the number of | 
|  | changes necessary. | 
|  |  | 
|  | Here's the sample program we'll be compiling: | 
|  |  | 
|  | .. code-block:: python | 
|  |  | 
|  | def fib(x) | 
|  | if x < 3 then | 
|  | 1 | 
|  | else | 
|  | fib(x-1)+fib(x-2); | 
|  |  | 
|  | fib(10) | 
|  |  | 
|  |  | 
|  | Why is this a hard problem? | 
|  | =========================== | 
|  |  | 
|  | Debug information is a hard problem for a few different reasons - mostly | 
|  | centered around optimized code. First, optimization makes keeping source | 
|  | locations more difficult. In LLVM IR we keep the original source location | 
|  | for each IR level instruction on the instruction. Optimization passes | 
|  | should keep the source locations for newly created instructions, but merged | 
|  | instructions only get to keep a single location - this can cause jumping | 
|  | around when stepping through optimized programs. Secondly, optimization | 
|  | can move variables in ways that are either optimized out, shared in memory | 
|  | with other variables, or difficult to track. For the purposes of this | 
|  | tutorial we're going to avoid optimization (as you'll see with one of the | 
|  | next sets of patches). | 
|  |  | 
|  | Ahead-of-Time Compilation Mode | 
|  | ============================== | 
|  |  | 
|  | To highlight only the aspects of adding debug information to a source | 
|  | language without needing to worry about the complexities of JIT debugging | 
|  | we're going to make a few changes to Kaleidoscope to support compiling | 
|  | the IR emitted by the front end into a simple standalone program that | 
|  | you can execute, debug, and see results. | 
|  |  | 
|  | First we make our anonymous function that contains our top level | 
|  | statement be our "main": | 
|  |  | 
|  | .. code-block:: udiff | 
|  |  | 
|  | -    auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>()); | 
|  | +    auto Proto = llvm::make_unique<PrototypeAST>("main", std::vector<std::string>()); | 
|  |  | 
|  | just with the simple change of giving it a name. | 
|  |  | 
|  | Then we're going to remove the command line code wherever it exists: | 
|  |  | 
|  | .. code-block:: udiff | 
|  |  | 
|  | @@ -1129,7 +1129,6 @@ static void HandleTopLevelExpression() { | 
|  | /// top ::= definition | external | expression | ';' | 
|  | static void MainLoop() { | 
|  | while (1) { | 
|  | -    fprintf(stderr, "ready> "); | 
|  | switch (CurTok) { | 
|  | case tok_eof: | 
|  | return; | 
|  | @@ -1184,7 +1183,6 @@ int main() { | 
|  | BinopPrecedence['*'] = 40; // highest. | 
|  |  | 
|  | // Prime the first token. | 
|  | -  fprintf(stderr, "ready> "); | 
|  | getNextToken(); | 
|  |  | 
|  | Lastly we're going to disable all of the optimization passes and the JIT so | 
|  | that the only thing that happens after we're done parsing and generating | 
|  | code is that the LLVM IR goes to standard error: | 
|  |  | 
|  | .. code-block:: udiff | 
|  |  | 
|  | @@ -1108,17 +1108,8 @@ static void HandleExtern() { | 
|  | static void HandleTopLevelExpression() { | 
|  | // Evaluate a top-level expression into an anonymous function. | 
|  | if (auto FnAST = ParseTopLevelExpr()) { | 
|  | -    if (auto *FnIR = FnAST->codegen()) { | 
|  | -      // We're just doing this to make sure it executes. | 
|  | -      TheExecutionEngine->finalizeObject(); | 
|  | -      // JIT the function, returning a function pointer. | 
|  | -      void *FPtr = TheExecutionEngine->getPointerToFunction(FnIR); | 
|  | - | 
|  | -      // Cast it to the right type (takes no arguments, returns a double) so we | 
|  | -      // can call it as a native function. | 
|  | -      double (*FP)() = (double (*)())(intptr_t)FPtr; | 
|  | -      // Ignore the return value for this. | 
|  | -      (void)FP; | 
|  | +    if (!F->codegen()) { | 
|  | +      fprintf(stderr, "Error generating code for top level expr"); | 
|  | } | 
|  | } else { | 
|  | // Skip token for error recovery. | 
|  | @@ -1439,11 +1459,11 @@ int main() { | 
|  | // target lays out data structures. | 
|  | TheModule->setDataLayout(TheExecutionEngine->getDataLayout()); | 
|  | OurFPM.add(new DataLayoutPass()); | 
|  | +#if 0 | 
|  | OurFPM.add(createBasicAliasAnalysisPass()); | 
|  | // Promote allocas to registers. | 
|  | OurFPM.add(createPromoteMemoryToRegisterPass()); | 
|  | @@ -1218,7 +1210,7 @@ int main() { | 
|  | OurFPM.add(createGVNPass()); | 
|  | // Simplify the control flow graph (deleting unreachable blocks, etc). | 
|  | OurFPM.add(createCFGSimplificationPass()); | 
|  | - | 
|  | +  #endif | 
|  | OurFPM.doInitialization(); | 
|  |  | 
|  | // Set the global so the code gen can use this. | 
|  |  | 
|  | This relatively small set of changes get us to the point that we can compile | 
|  | our piece of Kaleidoscope language down to an executable program via this | 
|  | command line: | 
|  |  | 
|  | .. code-block:: bash | 
|  |  | 
|  | Kaleidoscope-Ch9 < fib.ks | & clang -x ir - | 
|  |  | 
|  | which gives an a.out/a.exe in the current working directory. | 
|  |  | 
|  | Compile Unit | 
|  | ============ | 
|  |  | 
|  | The top level container for a section of code in DWARF is a compile unit. | 
|  | This contains the type and function data for an individual translation unit | 
|  | (read: one file of source code). So the first thing we need to do is | 
|  | construct one for our fib.ks file. | 
|  |  | 
|  | DWARF Emission Setup | 
|  | ==================== | 
|  |  | 
|  | Similar to the ``IRBuilder`` class we have a | 
|  | `DIBuilder <http://llvm.org/doxygen/classllvm_1_1DIBuilder.html>`_ class | 
|  | that helps in constructing debug metadata for an LLVM IR file. It | 
|  | corresponds 1:1 similarly to ``IRBuilder`` and LLVM IR, but with nicer names. | 
|  | Using it does require that you be more familiar with DWARF terminology than | 
|  | you needed to be with ``IRBuilder`` and ``Instruction`` names, but if you | 
|  | read through the general documentation on the | 
|  | `Metadata Format <http://llvm.org/docs/SourceLevelDebugging.html>`_ it | 
|  | should be a little more clear. We'll be using this class to construct all | 
|  | of our IR level descriptions. Construction for it takes a module so we | 
|  | need to construct it shortly after we construct our module. We've left it | 
|  | as a global static variable to make it a bit easier to use. | 
|  |  | 
|  | Next we're going to create a small container to cache some of our frequent | 
|  | data. The first will be our compile unit, but we'll also write a bit of | 
|  | code for our one type since we won't have to worry about multiple typed | 
|  | expressions: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | static DIBuilder *DBuilder; | 
|  |  | 
|  | struct DebugInfo { | 
|  | DICompileUnit *TheCU; | 
|  | DIType *DblTy; | 
|  |  | 
|  | DIType *getDoubleTy(); | 
|  | } KSDbgInfo; | 
|  |  | 
|  | DIType *DebugInfo::getDoubleTy() { | 
|  | if (DblTy) | 
|  | return DblTy; | 
|  |  | 
|  | DblTy = DBuilder->createBasicType("double", 64, dwarf::DW_ATE_float); | 
|  | return DblTy; | 
|  | } | 
|  |  | 
|  | And then later on in ``main`` when we're constructing our module: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | DBuilder = new DIBuilder(*TheModule); | 
|  |  | 
|  | KSDbgInfo.TheCU = DBuilder->createCompileUnit( | 
|  | dwarf::DW_LANG_C, DBuilder->createFile("fib.ks", "."), | 
|  | "Kaleidoscope Compiler", 0, "", 0); | 
|  |  | 
|  | There are a couple of things to note here. First, while we're producing a | 
|  | compile unit for a language called Kaleidoscope we used the language | 
|  | constant for C. This is because a debugger wouldn't necessarily understand | 
|  | the calling conventions or default ABI for a language it doesn't recognize | 
|  | and we follow the C ABI in our LLVM code generation so it's the closest | 
|  | thing to accurate. This ensures we can actually call functions from the | 
|  | debugger and have them execute. Secondly, you'll see the "fib.ks" in the | 
|  | call to ``createCompileUnit``. This is a default hard coded value since | 
|  | we're using shell redirection to put our source into the Kaleidoscope | 
|  | compiler. In a usual front end you'd have an input file name and it would | 
|  | go there. | 
|  |  | 
|  | One last thing as part of emitting debug information via DIBuilder is that | 
|  | we need to "finalize" the debug information. The reasons are part of the | 
|  | underlying API for DIBuilder, but make sure you do this near the end of | 
|  | main: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | DBuilder->finalize(); | 
|  |  | 
|  | before you dump out the module. | 
|  |  | 
|  | Functions | 
|  | ========= | 
|  |  | 
|  | Now that we have our ``Compile Unit`` and our source locations, we can add | 
|  | function definitions to the debug info. So in ``PrototypeAST::codegen()`` we | 
|  | add a few lines of code to describe a context for our subprogram, in this | 
|  | case the "File", and the actual definition of the function itself. | 
|  |  | 
|  | So the context: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | DIFile *Unit = DBuilder->createFile(KSDbgInfo.TheCU.getFilename(), | 
|  | KSDbgInfo.TheCU.getDirectory()); | 
|  |  | 
|  | giving us an DIFile and asking the ``Compile Unit`` we created above for the | 
|  | directory and filename where we are currently. Then, for now, we use some | 
|  | source locations of 0 (since our AST doesn't currently have source location | 
|  | information) and construct our function definition: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | DIScope *FContext = Unit; | 
|  | unsigned LineNo = 0; | 
|  | unsigned ScopeLine = 0; | 
|  | DISubprogram *SP = DBuilder->createFunction( | 
|  | FContext, P.getName(), StringRef(), Unit, LineNo, | 
|  | CreateFunctionType(TheFunction->arg_size(), Unit), | 
|  | false /* internal linkage */, true /* definition */, ScopeLine, | 
|  | DINode::FlagPrototyped, false); | 
|  | TheFunction->setSubprogram(SP); | 
|  |  | 
|  | and we now have an DISubprogram that contains a reference to all of our | 
|  | metadata for the function. | 
|  |  | 
|  | Source Locations | 
|  | ================ | 
|  |  | 
|  | The most important thing for debug information is accurate source location - | 
|  | this makes it possible to map your source code back. We have a problem though, | 
|  | Kaleidoscope really doesn't have any source location information in the lexer | 
|  | or parser so we'll need to add it. | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | struct SourceLocation { | 
|  | int Line; | 
|  | int Col; | 
|  | }; | 
|  | static SourceLocation CurLoc; | 
|  | static SourceLocation LexLoc = {1, 0}; | 
|  |  | 
|  | static int advance() { | 
|  | int LastChar = getchar(); | 
|  |  | 
|  | if (LastChar == '\n' || LastChar == '\r') { | 
|  | LexLoc.Line++; | 
|  | LexLoc.Col = 0; | 
|  | } else | 
|  | LexLoc.Col++; | 
|  | return LastChar; | 
|  | } | 
|  |  | 
|  | In this set of code we've added some functionality on how to keep track of the | 
|  | line and column of the "source file". As we lex every token we set our current | 
|  | current "lexical location" to the assorted line and column for the beginning | 
|  | of the token. We do this by overriding all of the previous calls to | 
|  | ``getchar()`` with our new ``advance()`` that keeps track of the information | 
|  | and then we have added to all of our AST classes a source location: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | class ExprAST { | 
|  | SourceLocation Loc; | 
|  |  | 
|  | public: | 
|  | ExprAST(SourceLocation Loc = CurLoc) : Loc(Loc) {} | 
|  | virtual ~ExprAST() {} | 
|  | virtual Value* codegen() = 0; | 
|  | int getLine() const { return Loc.Line; } | 
|  | int getCol() const { return Loc.Col; } | 
|  | virtual raw_ostream &dump(raw_ostream &out, int ind) { | 
|  | return out << ':' << getLine() << ':' << getCol() << '\n'; | 
|  | } | 
|  |  | 
|  | that we pass down through when we create a new expression: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | LHS = llvm::make_unique<BinaryExprAST>(BinLoc, BinOp, std::move(LHS), | 
|  | std::move(RHS)); | 
|  |  | 
|  | giving us locations for each of our expressions and variables. | 
|  |  | 
|  | To make sure that every instruction gets proper source location information, | 
|  | we have to tell ``Builder`` whenever we're at a new source location. | 
|  | We use a small helper function for this: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | void DebugInfo::emitLocation(ExprAST *AST) { | 
|  | DIScope *Scope; | 
|  | if (LexicalBlocks.empty()) | 
|  | Scope = TheCU; | 
|  | else | 
|  | Scope = LexicalBlocks.back(); | 
|  | Builder.SetCurrentDebugLocation( | 
|  | DebugLoc::get(AST->getLine(), AST->getCol(), Scope)); | 
|  | } | 
|  |  | 
|  | This both tells the main ``IRBuilder`` where we are, but also what scope | 
|  | we're in. The scope can either be on compile-unit level or be the nearest | 
|  | enclosing lexical block like the current function. | 
|  | To represent this we create a stack of scopes: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | std::vector<DIScope *> LexicalBlocks; | 
|  |  | 
|  | and push the scope (function) to the top of the stack when we start | 
|  | generating the code for each function: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | KSDbgInfo.LexicalBlocks.push_back(SP); | 
|  |  | 
|  | Also, we may not forget to pop the scope back off of the scope stack at the | 
|  | end of the code generation for the function: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | // Pop off the lexical block for the function since we added it | 
|  | // unconditionally. | 
|  | KSDbgInfo.LexicalBlocks.pop_back(); | 
|  |  | 
|  | Then we make sure to emit the location every time we start to generate code | 
|  | for a new AST object: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | KSDbgInfo.emitLocation(this); | 
|  |  | 
|  | Variables | 
|  | ========= | 
|  |  | 
|  | Now that we have functions, we need to be able to print out the variables | 
|  | we have in scope. Let's get our function arguments set up so we can get | 
|  | decent backtraces and see how our functions are being called. It isn't | 
|  | a lot of code, and we generally handle it when we're creating the | 
|  | argument allocas in ``FunctionAST::codegen``. | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | // Record the function arguments in the NamedValues map. | 
|  | NamedValues.clear(); | 
|  | unsigned ArgIdx = 0; | 
|  | for (auto &Arg : TheFunction->args()) { | 
|  | // Create an alloca for this variable. | 
|  | AllocaInst *Alloca = CreateEntryBlockAlloca(TheFunction, Arg.getName()); | 
|  |  | 
|  | // Create a debug descriptor for the variable. | 
|  | DILocalVariable *D = DBuilder->createParameterVariable( | 
|  | SP, Arg.getName(), ++ArgIdx, Unit, LineNo, KSDbgInfo.getDoubleTy(), | 
|  | true); | 
|  |  | 
|  | DBuilder->insertDeclare(Alloca, D, DBuilder->createExpression(), | 
|  | DebugLoc::get(LineNo, 0, SP), | 
|  | Builder.GetInsertBlock()); | 
|  |  | 
|  | // Store the initial value into the alloca. | 
|  | Builder.CreateStore(&Arg, Alloca); | 
|  |  | 
|  | // Add arguments to variable symbol table. | 
|  | NamedValues[Arg.getName()] = Alloca; | 
|  | } | 
|  |  | 
|  |  | 
|  | Here we're first creating the variable, giving it the scope (``SP``), | 
|  | the name, source location, type, and since it's an argument, the argument | 
|  | index. Next, we create an ``lvm.dbg.declare`` call to indicate at the IR | 
|  | level that we've got a variable in an alloca (and it gives a starting | 
|  | location for the variable), and setting a source location for the | 
|  | beginning of the scope on the declare. | 
|  |  | 
|  | One interesting thing to note at this point is that various debuggers have | 
|  | assumptions based on how code and debug information was generated for them | 
|  | in the past. In this case we need to do a little bit of a hack to avoid | 
|  | generating line information for the function prologue so that the debugger | 
|  | knows to skip over those instructions when setting a breakpoint. So in | 
|  | ``FunctionAST::CodeGen`` we add some more lines: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | // Unset the location for the prologue emission (leading instructions with no | 
|  | // location in a function are considered part of the prologue and the debugger | 
|  | // will run past them when breaking on a function) | 
|  | KSDbgInfo.emitLocation(nullptr); | 
|  |  | 
|  | and then emit a new location when we actually start generating code for the | 
|  | body of the function: | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | KSDbgInfo.emitLocation(Body.get()); | 
|  |  | 
|  | With this we have enough debug information to set breakpoints in functions, | 
|  | print out argument variables, and call functions. Not too bad for just a | 
|  | few simple lines of code! | 
|  |  | 
|  | Full Code Listing | 
|  | ================= | 
|  |  | 
|  | Here is the complete code listing for our running example, enhanced with | 
|  | debug information. To build this example, use: | 
|  |  | 
|  | .. code-block:: bash | 
|  |  | 
|  | # Compile | 
|  | clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core mcjit native` -O3 -o toy | 
|  | # Run | 
|  | ./toy | 
|  |  | 
|  | Here is the code: | 
|  |  | 
|  | .. literalinclude:: ../../examples/Kaleidoscope/Chapter9/toy.cpp | 
|  | :language: c++ | 
|  |  | 
|  | `Next: Conclusion and other useful LLVM tidbits <LangImpl10.html>`_ | 
|  |  |