|  | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" | 
|  | "http://www.w3.org/TR/html4/strict.dtd"> | 
|  |  | 
|  | <html> | 
|  | <head> | 
|  | <title>Kaleidoscope: Conclusion and other useful LLVM tidbits</title> | 
|  | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> | 
|  | <meta name="author" content="Chris Lattner"> | 
|  | <link rel="stylesheet" href="../llvm.css" type="text/css"> | 
|  | </head> | 
|  |  | 
|  | <body> | 
|  |  | 
|  | <h1>Kaleidoscope: Conclusion and other useful LLVM tidbits</h1> | 
|  |  | 
|  | <ul> | 
|  | <li><a href="index.html">Up to Tutorial Index</a></li> | 
|  | <li>Chapter 8 | 
|  | <ol> | 
|  | <li><a href="#conclusion">Tutorial Conclusion</a></li> | 
|  | <li><a href="#llvmirproperties">Properties of LLVM IR</a> | 
|  | <ul> | 
|  | <li><a href="#targetindep">Target Independence</a></li> | 
|  | <li><a href="#safety">Safety Guarantees</a></li> | 
|  | <li><a href="#langspecific">Language-Specific Optimizations</a></li> | 
|  | </ul> | 
|  | </li> | 
|  | <li><a href="#tipsandtricks">Tips and Tricks</a> | 
|  | <ul> | 
|  | <li><a href="#offsetofsizeof">Implementing portable | 
|  | offsetof/sizeof</a></li> | 
|  | <li><a href="#gcstack">Garbage Collected Stack Frames</a></li> | 
|  | </ul> | 
|  | </li> | 
|  | </ol> | 
|  | </li> | 
|  | </ul> | 
|  |  | 
|  |  | 
|  | <div class="doc_author"> | 
|  | <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> | 
|  | </div> | 
|  |  | 
|  | <!-- *********************************************************************** --> | 
|  | <h2><a name="conclusion">Tutorial Conclusion</a></h2> | 
|  | <!-- *********************************************************************** --> | 
|  |  | 
|  | <div> | 
|  |  | 
|  | <p>Welcome to the the final chapter of the "<a href="index.html">Implementing a | 
|  | language with LLVM</a>" tutorial.  In the course of this tutorial, we have grown | 
|  | our little Kaleidoscope language from being a useless toy, to being a | 
|  | semi-interesting (but probably still useless) toy. :)</p> | 
|  |  | 
|  | <p>It is interesting to see how far we've come, and how little code it has | 
|  | taken.  We built the entire lexer, parser, AST, code generator, and an | 
|  | interactive run-loop (with a JIT!) by-hand in under 700 lines of | 
|  | (non-comment/non-blank) code.</p> | 
|  |  | 
|  | <p>Our little language supports a couple of interesting features: it supports | 
|  | user defined binary and unary operators, it uses JIT compilation for immediate | 
|  | evaluation, and it supports a few control flow constructs with SSA construction. | 
|  | </p> | 
|  |  | 
|  | <p>Part of the idea of this tutorial was to show you how easy and fun it can be | 
|  | to define, build, and play with languages.  Building a compiler need not be a | 
|  | scary or mystical process!  Now that you've seen some of the basics, I strongly | 
|  | encourage you to take the code and hack on it.  For example, try adding:</p> | 
|  |  | 
|  | <ul> | 
|  | <li><b>global variables</b> - While global variables have questional value in | 
|  | modern software engineering, they are often useful when putting together quick | 
|  | little hacks like the Kaleidoscope compiler itself.  Fortunately, our current | 
|  | setup makes it very easy to add global variables: just have value lookup check | 
|  | to see if an unresolved variable is in the global variable symbol table before | 
|  | rejecting it.  To create a new global variable, make an instance of the LLVM | 
|  | <tt>GlobalVariable</tt> class.</li> | 
|  |  | 
|  | <li><b>typed variables</b> - Kaleidoscope currently only supports variables of | 
|  | type double.  This gives the language a very nice elegance, because only | 
|  | supporting one type means that you never have to specify types.  Different | 
|  | languages have different ways of handling this.  The easiest way is to require | 
|  | the user to specify types for every variable definition, and record the type | 
|  | of the variable in the symbol table along with its Value*.</li> | 
|  |  | 
|  | <li><b>arrays, structs, vectors, etc</b> - Once you add types, you can start | 
|  | extending the type system in all sorts of interesting ways.  Simple arrays are | 
|  | very easy and are quite useful for many different applications.  Adding them is | 
|  | mostly an exercise in learning how the LLVM <a | 
|  | href="../LangRef.html#i_getelementptr">getelementptr</a> instruction works: it | 
|  | is so nifty/unconventional, it <a | 
|  | href="../GetElementPtr.html">has its own FAQ</a>!  If you add support | 
|  | for recursive types (e.g. linked lists), make sure to read the <a | 
|  | href="../ProgrammersManual.html#TypeResolve">section in the LLVM | 
|  | Programmer's Manual</a> that describes how to construct them.</li> | 
|  |  | 
|  | <li><b>standard runtime</b> - Our current language allows the user to access | 
|  | arbitrary external functions, and we use it for things like "printd" and | 
|  | "putchard".  As you extend the language to add higher-level constructs, often | 
|  | these constructs make the most sense if they are lowered to calls into a | 
|  | language-supplied runtime.  For example, if you add hash tables to the language, | 
|  | it would probably make sense to add the routines to a runtime, instead of | 
|  | inlining them all the way.</li> | 
|  |  | 
|  | <li><b>memory management</b> - Currently we can only access the stack in | 
|  | Kaleidoscope.  It would also be useful to be able to allocate heap memory, | 
|  | either with calls to the standard libc malloc/free interface or with a garbage | 
|  | collector.  If you would like to use garbage collection, note that LLVM fully | 
|  | supports <a href="../GarbageCollection.html">Accurate Garbage Collection</a> | 
|  | including algorithms that move objects and need to scan/update the stack.</li> | 
|  |  | 
|  | <li><b>debugger support</b> - LLVM supports generation of <a | 
|  | href="../SourceLevelDebugging.html">DWARF Debug info</a> which is understood by | 
|  | common debuggers like GDB.  Adding support for debug info is fairly | 
|  | straightforward.  The best way to understand it is to compile some C/C++ code | 
|  | with "<tt>llvm-gcc -g -O0</tt>" and taking a look at what it produces.</li> | 
|  |  | 
|  | <li><b>exception handling support</b> - LLVM supports generation of <a | 
|  | href="../ExceptionHandling.html">zero cost exceptions</a> which interoperate | 
|  | with code compiled in other languages.  You could also generate code by | 
|  | implicitly making every function return an error value and checking it.  You | 
|  | could also make explicit use of setjmp/longjmp.  There are many different ways | 
|  | to go here.</li> | 
|  |  | 
|  | <li><b>object orientation, generics, database access, complex numbers, | 
|  | geometric programming, ...</b> - Really, there is | 
|  | no end of crazy features that you can add to the language.</li> | 
|  |  | 
|  | <li><b>unusual domains</b> - We've been talking about applying LLVM to a domain | 
|  | that many people are interested in: building a compiler for a specific language. | 
|  | However, there are many other domains that can use compiler technology that are | 
|  | not typically considered.  For example, LLVM has been used to implement OpenGL | 
|  | graphics acceleration, translate C++ code to ActionScript, and many other | 
|  | cute and clever things.  Maybe you will be the first to JIT compile a regular | 
|  | expression interpreter into native code with LLVM?</li> | 
|  |  | 
|  | </ul> | 
|  |  | 
|  | <p> | 
|  | Have fun - try doing something crazy and unusual.  Building a language like | 
|  | everyone else always has, is much less fun than trying something a little crazy | 
|  | or off the wall and seeing how it turns out.  If you get stuck or want to talk | 
|  | about it, feel free to email the <a | 
|  | href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">llvmdev mailing | 
|  | list</a>: it has lots of people who are interested in languages and are often | 
|  | willing to help out. | 
|  | </p> | 
|  |  | 
|  | <p>Before we end this tutorial, I want to talk about some "tips and tricks" for generating | 
|  | LLVM IR.  These are some of the more subtle things that may not be obvious, but | 
|  | are very useful if you want to take advantage of LLVM's capabilities.</p> | 
|  |  | 
|  | </div> | 
|  |  | 
|  | <!-- *********************************************************************** --> | 
|  | <h2><a name="llvmirproperties">Properties of the LLVM IR</a></h2> | 
|  | <!-- *********************************************************************** --> | 
|  |  | 
|  | <div> | 
|  |  | 
|  | <p>We have a couple common questions about code in the LLVM IR form - lets just | 
|  | get these out of the way right now, shall we?</p> | 
|  |  | 
|  | <!-- ======================================================================= --> | 
|  | <h4><a name="targetindep">Target Independence</a></h4> | 
|  | <!-- ======================================================================= --> | 
|  |  | 
|  | <div> | 
|  |  | 
|  | <p>Kaleidoscope is an example of a "portable language": any program written in | 
|  | Kaleidoscope will work the same way on any target that it runs on.  Many other | 
|  | languages have this property, e.g. lisp, java, haskell, javascript, python, etc | 
|  | (note that while these languages are portable, not all their libraries are).</p> | 
|  |  | 
|  | <p>One nice aspect of LLVM is that it is often capable of preserving target | 
|  | independence in the IR: you can take the LLVM IR for a Kaleidoscope-compiled | 
|  | program and run it on any target that LLVM supports, even emitting C code and | 
|  | compiling that on targets that LLVM doesn't support natively.  You can trivially | 
|  | tell that the Kaleidoscope compiler generates target-independent code because it | 
|  | never queries for any target-specific information when generating code.</p> | 
|  |  | 
|  | <p>The fact that LLVM provides a compact, target-independent, representation for | 
|  | code gets a lot of people excited.  Unfortunately, these people are usually | 
|  | thinking about C or a language from the C family when they are asking questions | 
|  | about language portability.  I say "unfortunately", because there is really no | 
|  | way to make (fully general) C code portable, other than shipping the source code | 
|  | around (and of course, C source code is not actually portable in general | 
|  | either - ever port a really old application from 32- to 64-bits?).</p> | 
|  |  | 
|  | <p>The problem with C (again, in its full generality) is that it is heavily | 
|  | laden with target specific assumptions.  As one simple example, the preprocessor | 
|  | often destructively removes target-independence from the code when it processes | 
|  | the input text:</p> | 
|  |  | 
|  | <div class="doc_code"> | 
|  | <pre> | 
|  | #ifdef __i386__ | 
|  | int X = 1; | 
|  | #else | 
|  | int X = 42; | 
|  | #endif | 
|  | </pre> | 
|  | </div> | 
|  |  | 
|  | <p>While it is possible to engineer more and more complex solutions to problems | 
|  | like this, it cannot be solved in full generality in a way that is better than shipping | 
|  | the actual source code.</p> | 
|  |  | 
|  | <p>That said, there are interesting subsets of C that can be made portable.  If | 
|  | you are willing to fix primitive types to a fixed size (say int = 32-bits, | 
|  | and long = 64-bits), don't care about ABI compatibility with existing binaries, | 
|  | and are willing to give up some other minor features, you can have portable | 
|  | code.  This can make sense for specialized domains such as an | 
|  | in-kernel language.</p> | 
|  |  | 
|  | </div> | 
|  |  | 
|  | <!-- ======================================================================= --> | 
|  | <h4><a name="safety">Safety Guarantees</a></h4> | 
|  | <!-- ======================================================================= --> | 
|  |  | 
|  | <div> | 
|  |  | 
|  | <p>Many of the languages above are also "safe" languages: it is impossible for | 
|  | a program written in Java to corrupt its address space and crash the process | 
|  | (assuming the JVM has no bugs). | 
|  | Safety is an interesting property that requires a combination of language | 
|  | design, runtime support, and often operating system support.</p> | 
|  |  | 
|  | <p>It is certainly possible to implement a safe language in LLVM, but LLVM IR | 
|  | does not itself guarantee safety.  The LLVM IR allows unsafe pointer casts, | 
|  | use after free bugs, buffer over-runs, and a variety of other problems.  Safety | 
|  | needs to be implemented as a layer on top of LLVM and, conveniently, several | 
|  | groups have investigated this.  Ask on the <a | 
|  | href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">llvmdev mailing | 
|  | list</a> if you are interested in more details.</p> | 
|  |  | 
|  | </div> | 
|  |  | 
|  | <!-- ======================================================================= --> | 
|  | <h4><a name="langspecific">Language-Specific Optimizations</a></h4> | 
|  | <!-- ======================================================================= --> | 
|  |  | 
|  | <div> | 
|  |  | 
|  | <p>One thing about LLVM that turns off many people is that it does not solve all | 
|  | the world's problems in one system (sorry 'world hunger', someone else will have | 
|  | to solve you some other day).  One specific complaint is that people perceive | 
|  | LLVM as being incapable of performing high-level language-specific optimization: | 
|  | LLVM "loses too much information".</p> | 
|  |  | 
|  | <p>Unfortunately, this is really not the place to give you a full and unified | 
|  | version of "Chris Lattner's theory of compiler design".  Instead, I'll make a | 
|  | few observations:</p> | 
|  |  | 
|  | <p>First, you're right that LLVM does lose information.  For example, as of this | 
|  | writing, there is no way to distinguish in the LLVM IR whether an SSA-value came | 
|  | from a C "int" or a C "long" on an ILP32 machine (other than debug info).  Both | 
|  | get compiled down to an 'i32' value and the information about what it came from | 
|  | is lost.  The more general issue here, is that the LLVM type system uses | 
|  | "structural equivalence" instead of "name equivalence".  Another place this | 
|  | surprises people is if you have two types in a high-level language that have the | 
|  | same structure (e.g. two different structs that have a single int field): these | 
|  | types will compile down into a single LLVM type and it will be impossible to | 
|  | tell what it came from.</p> | 
|  |  | 
|  | <p>Second, while LLVM does lose information, LLVM is not a fixed target: we | 
|  | continue to enhance and improve it in many different ways.  In addition to | 
|  | adding new features (LLVM did not always support exceptions or debug info), we | 
|  | also extend the IR to capture important information for optimization (e.g. | 
|  | whether an argument is sign or zero extended, information about pointers | 
|  | aliasing, etc).  Many of the enhancements are user-driven: people want LLVM to | 
|  | include some specific feature, so they go ahead and extend it.</p> | 
|  |  | 
|  | <p>Third, it is <em>possible and easy</em> to add language-specific | 
|  | optimizations, and you have a number of choices in how to do it.  As one trivial | 
|  | example, it is easy to add language-specific optimization passes that | 
|  | "know" things about code compiled for a language.  In the case of the C family, | 
|  | there is an optimization pass that "knows" about the standard C library | 
|  | functions.  If you call "exit(0)" in main(), it knows that it is safe to | 
|  | optimize that into "return 0;" because C specifies what the 'exit' | 
|  | function does.</p> | 
|  |  | 
|  | <p>In addition to simple library knowledge, it is possible to embed a variety of | 
|  | other language-specific information into the LLVM IR.  If you have a specific | 
|  | need and run into a wall, please bring the topic up on the llvmdev list.  At the | 
|  | very worst, you can always treat LLVM as if it were a "dumb code generator" and | 
|  | implement the high-level optimizations you desire in your front-end, on the | 
|  | language-specific AST. | 
|  | </p> | 
|  |  | 
|  | </div> | 
|  |  | 
|  | </div> | 
|  |  | 
|  | <!-- *********************************************************************** --> | 
|  | <h2><a name="tipsandtricks">Tips and Tricks</a></h2> | 
|  | <!-- *********************************************************************** --> | 
|  |  | 
|  | <div> | 
|  |  | 
|  | <p>There is a variety of useful tips and tricks that you come to know after | 
|  | working on/with LLVM that aren't obvious at first glance.  Instead of letting | 
|  | everyone rediscover them, this section talks about some of these issues.</p> | 
|  |  | 
|  | <!-- ======================================================================= --> | 
|  | <h4><a name="offsetofsizeof">Implementing portable offsetof/sizeof</a></h4> | 
|  | <!-- ======================================================================= --> | 
|  |  | 
|  | <div> | 
|  |  | 
|  | <p>One interesting thing that comes up, if you are trying to keep the code | 
|  | generated by your compiler "target independent", is that you often need to know | 
|  | the size of some LLVM type or the offset of some field in an llvm structure. | 
|  | For example, you might need to pass the size of a type into a function that | 
|  | allocates memory.</p> | 
|  |  | 
|  | <p>Unfortunately, this can vary widely across targets: for example the width of | 
|  | a pointer is trivially target-specific.  However, there is a <a | 
|  | href="http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt">clever | 
|  | way to use the getelementptr instruction</a> that allows you to compute this | 
|  | in a portable way.</p> | 
|  |  | 
|  | </div> | 
|  |  | 
|  | <!-- ======================================================================= --> | 
|  | <h4><a name="gcstack">Garbage Collected Stack Frames</a></h4> | 
|  | <!-- ======================================================================= --> | 
|  |  | 
|  | <div> | 
|  |  | 
|  | <p>Some languages want to explicitly manage their stack frames, often so that | 
|  | they are garbage collected or to allow easy implementation of closures.  There | 
|  | are often better ways to implement these features than explicit stack frames, | 
|  | but <a | 
|  | href="http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt">LLVM | 
|  | does support them,</a> if you want.  It requires your front-end to convert the | 
|  | code into <a | 
|  | href="http://en.wikipedia.org/wiki/Continuation-passing_style">Continuation | 
|  | Passing Style</a> and the use of tail calls (which LLVM also supports).</p> | 
|  |  | 
|  | </div> | 
|  |  | 
|  | </div> | 
|  |  | 
|  | <!-- *********************************************************************** --> | 
|  | <hr> | 
|  | <address> | 
|  | <a href="http://jigsaw.w3.org/css-validator/check/referer"><img | 
|  | src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> | 
|  | <a href="http://validator.w3.org/check/referer"><img | 
|  | src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> | 
|  |  | 
|  | <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> | 
|  | <a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br> | 
|  | Last modified: $Date$ | 
|  | </address> | 
|  | </body> | 
|  | </html> |