| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| <html> |
| <head> |
| <title>Tutorial for building tools using LibTooling and LibASTMatchers</title> |
| <link type="text/css" rel="stylesheet" href="../menu.css" /> |
| <link type="text/css" rel="stylesheet" href="../content.css" /> |
| </head> |
| <body> |
| |
| <!--#include virtual="../menu.html.incl"--> |
| |
| <div id="content"> |
| |
| <h1>Tutorial for building tools using LibTooling and LibASTMatchers</h1> |
| <p>This document is intended to show how to build a useful source-to-source |
| translation tool based on Clang's <a href="LibTooling.html">LibTooling</a>. It |
| is explicitly aimed at people who are new to Clang, so all you should need is a |
| working knowledge of C++ and the command line.</p> |
| |
| <p>In order to work on the compiler, you need some basic knowledge of the |
| abstract syntax tree (AST). To this end, the reader is incouraged to skim the |
| <a href="http://clang.llvm.org/docs/IntroductionToTheClangAST.html">Introduction |
| to the Clang AST</a></p> |
| |
| <!-- ======================================================================= --> |
| <h2 id="obtainingclang">Step 0: Obtaining Clang</h2> |
| <!-- ======================================================================= --> |
| As Clang is part of the LLVM project, you'll need to download LLVM's source code |
| first. Both Clang and LLVM are maintained as Subversion repositories, but we'll |
| be accessing them through the git mirror. For further information, see the |
| <a href="http://llvm.org/docs/GettingStarted.html">getting started guide</a>. |
| |
| <pre class="doc_code"> |
| mkdir ~/clang-llvm && cd ~/clang-llvm |
| git clone http://llvm.org/git/llvm.git |
| cd llvm/tools |
| git clone http://llvm.org/git/clang.git |
| </pre> |
| |
| Next you need to obtain the CMake build system and Ninja build tool. You may |
| already have CMake installed, but current binary versions of CMake aren't built |
| with Ninja support. |
| |
| <pre class="doc_code"> |
| cd ~/clang-llvm |
| git clone https://github.com/martine/ninja.git |
| cd ninja |
| git checkout release |
| ./bootstrap.py |
| sudo cp ninja /usr/bin/ |
| |
| cd ~/clang-llvm |
| git clone git://cmake.org/stage/cmake.git |
| cd cmake |
| git checkout next |
| ./bootstrap |
| make |
| sudo make install |
| </pre> |
| |
| <p>Okay. Now we'll build Clang!</p> |
| |
| <pre class="doc_code"> |
| cd ~/clang-llvm |
| mkdir build && cd build |
| cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. |
| ninja |
| ninja check # Test LLVM only. |
| ninja clang-test # Test Clang only. |
| ninja install |
| </pre> |
| |
| <p>And we're live.</p> |
| |
| <p>All of the tests should pass, though there is a (very) small chance that you |
| can catch LLVM and Clang out of sync. Running <tt>'git svn rebase'</tt> in both |
| the llvm and clang directories should fix any problems.</p> |
| |
| <p>Finally, we want to set Clang as its own compiler.</p> |
| |
| <pre class="doc_code"> |
| cd ~/clang-llvm/build |
| ccmake ../llvm |
| </pre> |
| |
| <p>The second command will bring up a GUI for configuring Clang. You need to set |
| the entry for <tt>CMAKE_CXX_COMPILER</tt>. Press <tt>'t'</tt> to turn on |
| advanced mode. Scroll down to <tt>CMAKE_CXX_COMPILER</tt>, and set it to |
| <tt>/usr/bin/clang++</tt>, or wherever you installed it. Press <tt>'c'</tt> to |
| configure, then <tt>'g'</tt> to generate CMake's files.</p> |
| |
| <p>Finally, run ninja one last time, and you're done.</p> |
| |
| <!-- ======================================================================= --> |
| <h2 id="clangtool">Step 1: Create a ClangTool</h2> |
| <!-- ======================================================================= --> |
| <p>Now that we have enough background knowledge, it's time to create the |
| simplest productive ClangTool in existence: a syntax checker. While this already |
| exists as <tt>clang-check</tt>, it's important to understand what's going |
| on.</p> |
| |
| <p>First, we'll need to create a new directory for our tool and tell CMake that |
| it exists. As this is not going to be a core clang tool, it will live in the |
| <tt>tools/extra</tt> repository.</p> |
| |
| <pre class="doc_code"> |
| cd ~/clang-llvm/llvm/tools/clang |
| mkdir tools/extra/loop-convert |
| echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt |
| vim tools/extra/loop-convert/CMakeLists.txt |
| </pre> |
| |
| CMakeLists.txt should have the following contents: |
| <pre class="doc_code"> |
| set(LLVM_LINK_COMPONENTS support) |
| set(LLVM_USED_LIBS clangTooling clangBasic clangAST) |
| |
| add_clang_executable(loop-convert |
| LoopConvert.cpp |
| ) |
| target_link_libraries(loop-convert |
| clangTooling |
| clangBasic |
| clangASTMatchers |
| ) |
| </pre> |
| |
| <p>With that done, Ninja will be able to compile our tool. Let's give it |
| something to compile! Put the following into |
| <tt>tools/extra/loop-convert/LoopConvert.cpp</tt>. A detailed explanation of why |
| the different parts are needed can be found in the |
| <a href="LibTooling.html">LibTooling documentation</a>.</p> |
| |
| <pre> |
| // Declares clang::SyntaxOnlyAction. |
| #include "clang/Frontend/FrontendActions.h" |
| #include "clang/Tooling/CommonOptionsParser.h" |
| #include "clang/Tooling/Tooling.h" |
| // Declares llvm::cl::extrahelp. |
| #include "llvm/Support/CommandLine.h" |
| |
| using namespace clang::tooling; |
| using namespace llvm; |
| |
| // CommonOptionsParser declares HelpMessage with a description of the common |
| // command-line options related to the compilation database and input files. |
| // It's nice to have this help message in all tools. |
| static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); |
| |
| // A help message for this specific tool can be added afterwards. |
| static cl::extrahelp MoreHelp("\nMore help text..."); |
| |
| int main(int argc, const char **argv) { |
| CommonOptionsParser OptionsParser(argc, argv); |
| ClangTool Tool(OptionsParser.GetCompilations(), |
| OptionsParser.GetSourcePathList()); |
| return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>()); |
| } |
| </pre> |
| |
| <p>And that's it! You can compile our new tool by running ninja from the |
| <tt>build</tt> directory.</p> |
| |
| <pre class="doc_code"> |
| cd ~/clang-llvm/build |
| ninja |
| </pre> |
| |
| <p>You should now be able to run the syntax checker, which is located in |
| <tt>~/clang-llvm/build/bin</tt>, on any source file. Try it!</p> |
| |
| <pre class="doc_code"> |
| cat "void main() {}" > test.cpp |
| bin/loop-convert test.cpp -- |
| </pre> |
| |
| <p>Note the two dashes after we specify the source file. The additional options |
| for the compiler are passed after the dashes rather than loading them from a |
| compilation database - there just aren't any options needed right now.</p> |
| |
| <!-- ======================================================================= --> |
| <h2 id="learnastmatchers">Intermezzo: Learn AST matcher basics</h2> |
| <!-- ======================================================================= --> |
| <p>Clang recently introduced the <a href=LibASTMatchers.html>ASTMatcher |
| library</a> to provide a simple, powerful, and concise way to describe specific |
| patterns in the AST. Implemented as a DSL powered by macros and templates (see |
| <a href="../doxygen/ASTMatchers_8h_source.html">ASTMatchers.h</a> if you're |
| curious), matchers offer the feel of algebraic data types common to functional |
| programming languages.</p> |
| |
| <p>For example, suppose you wanted to examine only binary operators. There is a |
| matcher to do exactly that, conveniently named <tt>binaryOperator</tt>. I'll |
| give you one guess what this matcher does:</p> |
| |
| <pre class="doc_code"> |
| binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) |
| </pre> |
| |
| <p>Shockingly, it will match against addition expressions whose left hand side |
| is exactly the literal 0. It will not match against other forms of 0, such as |
| <tt>'\0'</tt> or <tt>NULL</tt>, but it will match against macros that expand to |
| 0. The matcher will also not match against calls to the overloaded operator |
| <tt>'+'</tt>, as there is a separate <tt>operatorCallExpr</tt> matcher to handle |
| overloaded operators.</p> |
| |
| <p>There are AST matchers to match all the different nodes of the AST, narrowing |
| matchers to only match AST nodes fulfilling specific criteria, and traversal |
| matchers to get from one kind of AST node to another. For a complete list of AST |
| matchers, take a look at the <a href="LibASTMatchersReference.html">AST Matcher |
| References</a></p> |
| |
| <p>All matcher that are nouns describe entities in the AST and can be bound, |
| so that they can be referred to whenever a match is found. To do so, simply call |
| the method <tt>bind</tt> on these matchers, e.g.:</p> |
| <pre class="doc_code"> |
| variable(hasType(isInteger())).bind("intvar") |
| </pre> |
| |
| <!-- ======================================================================= --> |
| <h2 id="usingastmatchers">Step 2: Using AST matchers</h2> |
| <!-- ======================================================================= --> |
| <p>Okay, on to using matchers for real. Let's start by defining a matcher which |
| will capture all <tt>for</tt> statements that define a new variable |
| initialized to zero. Let's start with matching all <tt>for</tt> loops:</p> |
| |
| <pre class="doc_code"> |
| forStmt() |
| </pre> |
| |
| <p>Next, we want to specify that a single variable is declared in the first |
| portion of the loop, so we can extend the matcher to</p> |
| <pre class="doc_code"> |
| forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) |
| </pre> |
| |
| <p>Finally, we can add the condition that the variable is initialized to |
| zero.</p> |
| <pre class="doc_code"> |
| forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( |
| hasInitializer(integerLiteral(equals(0)))))))) |
| </pre> |
| |
| </p>It is fairly easy to read and understand the matcher definition ("match |
| loops whose init portion declares a single variable which is initialized to the |
| integer literal 0"), but deciding that every piece is necessary is more |
| difficult. Note that this matcher will not match loops whose variables are |
| initialized to <tt>'\0'</tt>, <tt>0.0</tt>, <tt>NULL</tt>, or any form of zero |
| besides the integer 0.</p> |
| |
| <p>The last step is giving the matcher a name and binding the <tt>ForStmt</tt> |
| as we will want to do something with it:</p> |
| <pre class="doc_code"> |
| StatementMatcher LoopMatcher = |
| forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( |
| hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); |
| </pre> |
| |
| <p>Once you have defined your matchers, you will need to add a little more |
| scaffolding in order to run them. Matchers are paired with a |
| <tt>MatchCallback</tt> and registered with a <tt>MatchFinder</tt> object, then |
| run from a <tt>ClangTool</tt>. More code!</p> |
| |
| Add the following to <tt>LoopConvert.cpp</tt>: |
| <pre class="doc_code"> |
| StatementMatcher LoopMatcher = |
| forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( |
| hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); |
| |
| class LoopPrinter : public MatchFinder::MatchCallback { |
| public : |
| virtual void run(const MatchFinder::MatchResult &Result) { |
| if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) |
| FS->dump(); |
| }; |
| </pre> |
| |
| And change <tt>main()</tt> to: |
| <pre class="doc_code"> |
| int main(int argc, const char **argv) { |
| CommonOptionsParser OptionsParser(argc, argv); |
| ClangTool Tool(OptionsParser.GetCompilations(), |
| OptionsParser.GetSourcePathList()); |
| |
| LoopPrinter Printer; |
| MatchFinder Finder; |
| Finder.addMatcher(LoopMatcher, &Printer); |
| |
| return Tool.run(newFrontendActionFactory(&Finder)); |
| } |
| </pre> |
| |
| <p>Now, you should be able to recompile and run the code to discover for loops. |
| Create a new file with a few examples, and test out our new handiwork:</p> |
| |
| <pre class="doc_code"> |
| cd ~/clang-llvm/llvm/llvm_build/ |
| ninja loop-convert |
| vim ~/test-files/simple-loops.cc |
| bin/loop-convert ~/test-files/simple-loops.cc |
| </pre> |
| |
| <!-- FIXME: add example step-2a --> |
| |
| <!-- ======================================================================= --> |
| <h2 id="morematchers">Step 3.5: More Complicated Matchers</h2> |
| <!-- ======================================================================= --> |
| <p>Our simple matcher is capable of discovering for loops, but we would still |
| need to filter out many more ourselves. We can do a good portion of the |
| remaining work with some cleverly chosen matchers, but first we need to decide |
| exactly which properties we want to allow.</p> |
| |
| <p>How can we characterize for loops over arrays which would be eligible for |
| translation to range-based syntax? Range based loops over arrays of size |
| <tt>N</tt> that:</p> |
| <ul> |
| <li>start at index <tt>0</tt></li> |
| <li>iterate consecutively</li> |
| <li>end at index <tt>N-1</tt></li> |
| </ul> |
| |
| <p>We already check for (1), so all we need to add is a check to the loop's |
| condition to ensure that the loop's index variable is compared against |
| <tt>N</tt> and another check to ensure that the increment step just increments |
| this same variable. The matcher for (2) is straightforward: require a pre- or |
| post-increment of the same variable declared in the init portion.</p> |
| |
| <p>Unfortunately, such a matcher is impossible to write. Matchers contain no |
| logic for comparing two arbitrary AST nodes and determining whether or not they |
| are equal, so the best we can do is matching more than we would like to allow, |
| and punting extra comparisons to the callback.</p> |
| |
| <p>In any case, we can start building this sub-matcher. We can require that the |
| increment step be a unary increment like this:</p> |
| |
| <pre class="doc_code"> |
| hasIncrement(unaryOperator(hasOperatorName("++"))) |
| </pre> |
| |
| <p>Specifying what is incremented introduces another quirk of Clang's AST: |
| Usages of variables are represented as <tt>DeclRefExpr</tt>'s ("declaration |
| reference expressions") because they are expressions which refer to variable |
| declarations. To find a <tt>unaryOperator</tt> that refers to a specific |
| declaration, we can simply add a second condition to it:</p> |
| <pre class="doc_code"> |
| hasIncrement(unaryOperator( |
| hasOperatorName("++"), |
| hasUnaryOperand(declRefExpr()))) |
| </pre> |
| |
| <p>Furthermore, we can restrict our matcher to only match if the incremented |
| variable is an integer:</p> |
| <pre class="doc_code"> |
| hasIncrement(unaryOperator( |
| hasOperatorName("++"), |
| hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) |
| </pre> |
| |
| </p>And the last step will be to attach an identifier to this variable, so that |
| we can retrieve it in the callback:</p> |
| <pre class="doc_code"> |
| hasIncrement(unaryOperator( |
| hasOperatorName("++"), |
| hasUnaryOperand(declRefExpr(to( |
| varDecl(hasType(isInteger())).bind("incrementVariable")))))) |
| </pre> |
| |
| <p>We can add this code to the definition of <tt>LoopMatcher</tt> and make sure |
| that our program, outfitted with the new matcher, only prints out loops that |
| declare a single variable initialized to zero and have an increment step |
| consisting of a unary increment of some variable.</p> |
| |
| <!-- FIXME: add example step-2b --> |
| |
| <p>Now, we just need to add a matcher to check if the condition part of the |
| <tt>for</tt> loop compares a variable against the size of the array. There is |
| only one problem - we don't know which array we're iterating over without |
| looking at the body of the loop! We are again restricted to approximating the |
| result we want with matchers, filling in the details in the callback. So we |
| start with:</p> |
| <pre class="doc_code"> |
| hasCondition(binaryOperator(hasOperatorName("<")) |
| </pre> |
| |
| <p>It makes sense to ensure that the left-hand side is a reference to a |
| variable, and that the right-hand side has integer type.</p> |
| <pre class="doc_code"> |
| hasCondition(binaryOperator( |
| hasOperatorName("<"), |
| hasRHS(expr(hasType(isInteger()))), |
| hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))))) |
| </pre> |
| |
| <!-- FIXME: add example step-2c --> |
| |
| <p>Why? Because it doesn't work. Of the three loops provided in |
| <tt>test-files/simple.cpp</tt>, zero of them have a matching condition. A quick |
| look at the AST dump of the first for loop, produced by the previous iteration |
| of loop-convert, shows us the answer:</p> |
| |
| <pre class="doc_code"> |
| (ForStmt 0x173b240 |
| (DeclStmt 0x173afc8 |
| 0x173af50 "int i = |
| (IntegerLiteral 0x173afa8 'int' 0)") |
| <<<NULL>>> |
| (BinaryOperator 0x173b060 '_Bool' '<' |
| (ImplicitCastExpr 0x173b030 'int' <LValueToRValue> |
| (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) |
| (ImplicitCastExpr 0x173b048 'int' <LValueToRValue> |
| (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) |
| (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' |
| (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) |
| (CompoundStatement … |
| </pre> |
| |
| <p>We already know that the declaration and increments both match, or this loop |
| wouldn't have been dumped. The culprit lies in the implicit cast applied to the |
| first operand (i.e. the LHS) of the less-than operator, an L-value to R-value |
| conversion applied to the expression referencing <tt>i</tt>. Thankfully, the |
| matcher library offers a solution to this problem in the form of |
| <tt>ignoringParenImpCasts</tt>, which instructs the matcher to ignore implicit |
| casts and parentheses before continuing to match. Adjusting the condition |
| operator will restore the desired match.</p> |
| |
| <pre class="doc_code"> |
| hasCondition(binaryOperator( |
| hasOperatorName("<"), |
| hasLHS(expr(hasType(isInteger()))), |
| hasRHS(ignoringParenImpCasts(declRefExpr( |
| to(varDecl(hasType(isInteger())))))))) |
| </pre> |
| |
| <p>After adding binds to the expressions we wished to capture and extracting the |
| identifier strings into variables, we have array-step-2 completed.</p> |
| |
| <!-- ======================================================================= --> |
| <h2 id="usingastmatchers">Step 4: Retrieving Matched Nodes</h2> |
| <!-- ======================================================================= --> |
| <p>So far, the matcher callback isn't very interesting: it just dumps the loop's |
| AST. At some point, we will need to make changes to the input source code. Next, |
| we'll work on using the nodes we bound in the previous step.</p> |
| |
| <p>The <tt>MatchFinder::run()</tt> callback takes a |
| <tt>MatchFinder::MatchResult&</tt> as its parameter. We're most interested in |
| its <tt>Context</tt> and <tt>Nodes</tt> members. Clang uses the |
| <tt>ASTContext</tt> class to represent contextual information about the AST, as |
| the name implies, though the most functionally important detail is that several |
| operations require an <tt>ASTContext*</tt> parameter. More immediately useful is |
| the set of matched nodes, and how we retrieve them.</p> |
| |
| <!-- FIXME: Where is this binding described? --> |
| |
| <p>Since we bind three variables (identified by ConditionVarName, |
| InitVarName, and IncrementVarName), we can obtain the matched nodes by using the |
| <tt>getNodeAs()</tt> member function.</p> |
| |
| <p>In <tt>LoopActions.cpp</tt>:</p> |
| <pre class="doc_code"> |
| #include "clang/AST/ASTContext.h" |
| |
| void LoopPrinter::run(const MatchFinder::MatchResult &Result) { |
| ASTContext *Context = Result.Context; |
| const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>(LoopName); |
| // We do not want to convert header files! |
| if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc())) |
| return; |
| const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>(IncrementVarName); |
| const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>(ConditionVarName); |
| const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>(InitVarName); |
| </pre> |
| |
| <p>Now that we have the three variables, represented by their respective |
| declarations, let's make sure that they're all the same, using a helper function |
| I call <tt>areSameVariable()</tt>.</p> |
| <pre class="doc_code"> |
| if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) |
| return; |
| llvm::outs() << "Potential array-based loop discovered.\n"; |
| } |
| </pre> |
| |
| <p>If execution reaches the end of <tt>LoopPrinter::run()</tt>, we know that the |
| loop shell that looks like</p> |
| <pre class="doc_code"> |
| for (int i= 0; i < expr(); ++i) { ... } |
| </pre> |
| |
| <p>For now, we will just print a message explaining that we found a loop. The |
| next section will deal with recursively traversing the AST to discover all |
| changes needed.</p> |
| |
| <p>As a side note, here is the implementation of <tt>areSameVariable</tt>. Clang |
| associates a <tt>VarDecl</tt> with each variable to represent the variable's |
| declaration. Since the "canonical" form of each declaration is unique by |
| address, all we need to do is make sure neither <tt>ValueDecl</tt> (base class |
| of <tt>VarDecl</tt>) is <tt>NULL</tt> and compare the canonical Decls.</p> |
| <pre class="doc_code"> |
| static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { |
| return First && Second && |
| First->getCanonicalDecl() == Second->getCanonicalDecl(); |
| } |
| </pre> |
| |
| <p>It's not as trivial to test if two expressions are the same, though Clang has |
| already done the hard work for us by providing a way to canonicalize |
| expressions:</p> |
| <pre class="doc_code"> |
| static bool areSameExpr(ASTContext* Context, const Expr *First, |
| const Expr *Second) { |
| if (!First || !Second) |
| return false; |
| llvm::FoldingSetNodeID FirstID, SecondID; |
| First->Profile(FirstID, *Context, true); |
| Second->Profile(SecondID, *Context, true); |
| return FirstID == SecondID; |
| } |
| </pre> |
| |
| <!-- FIXME: Add code example. --> |
| |
| <p>This code relies on the comparison between two |
| <tt>llvm::FoldingSetNodeIDs</tt>. As the documentation for |
| <tt>Stmt::Profile()</tt> indicates, the <tt>Profile()</tt> member function |
| builds a description of a node in the AST, based on its properties, along with |
| those of its children. <tt>FoldingSetNodeID</tt> then serves as a hash we can |
| use to compare expressions. We will need <tt>areSameExpr</tt> later. Before you |
| run the new code on the additional loops added to test-files/simple.cpp, try to |
| figure out which ones will be considered potentially convertible.</p> |
| |
| </div> |
| </body> |
| </html> |