Sean Silva | ee47edf | 2012-12-05 00:26:32 +0000 | [diff] [blame^] | 1 | ====================================================== |
| 2 | Kaleidoscope: Conclusion and other useful LLVM tidbits |
| 3 | ====================================================== |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | |
| 8 | Written by `Chris Lattner <mailto:sabre@nondot.org>`_ |
| 9 | |
| 10 | Tutorial Conclusion |
| 11 | =================== |
| 12 | |
| 13 | Welcome to the final chapter of the "`Implementing a language with |
| 14 | LLVM <index.html>`_" tutorial. In the course of this tutorial, we have |
| 15 | grown our little Kaleidoscope language from being a useless toy, to |
| 16 | being a semi-interesting (but probably still useless) toy. :) |
| 17 | |
| 18 | It is interesting to see how far we've come, and how little code it has |
| 19 | taken. We built the entire lexer, parser, AST, code generator, and an |
| 20 | interactive run-loop (with a JIT!) by-hand in under 700 lines of |
| 21 | (non-comment/non-blank) code. |
| 22 | |
| 23 | Our little language supports a couple of interesting features: it |
| 24 | supports user defined binary and unary operators, it uses JIT |
| 25 | compilation for immediate evaluation, and it supports a few control flow |
| 26 | constructs with SSA construction. |
| 27 | |
| 28 | Part of the idea of this tutorial was to show you how easy and fun it |
| 29 | can be to define, build, and play with languages. Building a compiler |
| 30 | need not be a scary or mystical process! Now that you've seen some of |
| 31 | the basics, I strongly encourage you to take the code and hack on it. |
| 32 | For example, try adding: |
| 33 | |
| 34 | - **global variables** - While global variables have questional value |
| 35 | in modern software engineering, they are often useful when putting |
| 36 | together quick little hacks like the Kaleidoscope compiler itself. |
| 37 | Fortunately, our current setup makes it very easy to add global |
| 38 | variables: just have value lookup check to see if an unresolved |
| 39 | variable is in the global variable symbol table before rejecting it. |
| 40 | To create a new global variable, make an instance of the LLVM |
| 41 | ``GlobalVariable`` class. |
| 42 | - **typed variables** - Kaleidoscope currently only supports variables |
| 43 | of type double. This gives the language a very nice elegance, because |
| 44 | only supporting one type means that you never have to specify types. |
| 45 | Different languages have different ways of handling this. The easiest |
| 46 | way is to require the user to specify types for every variable |
| 47 | definition, and record the type of the variable in the symbol table |
| 48 | along with its Value\*. |
| 49 | - **arrays, structs, vectors, etc** - Once you add types, you can start |
| 50 | extending the type system in all sorts of interesting ways. Simple |
| 51 | arrays are very easy and are quite useful for many different |
| 52 | applications. Adding them is mostly an exercise in learning how the |
| 53 | LLVM `getelementptr <../LangRef.html#i_getelementptr>`_ instruction |
| 54 | works: it is so nifty/unconventional, it `has its own |
| 55 | FAQ <../GetElementPtr.html>`_! If you add support for recursive types |
| 56 | (e.g. linked lists), make sure to read the `section in the LLVM |
| 57 | Programmer's Manual <../ProgrammersManual.html#TypeResolve>`_ that |
| 58 | describes how to construct them. |
| 59 | - **standard runtime** - Our current language allows the user to access |
| 60 | arbitrary external functions, and we use it for things like "printd" |
| 61 | and "putchard". As you extend the language to add higher-level |
| 62 | constructs, often these constructs make the most sense if they are |
| 63 | lowered to calls into a language-supplied runtime. For example, if |
| 64 | you add hash tables to the language, it would probably make sense to |
| 65 | add the routines to a runtime, instead of inlining them all the way. |
| 66 | - **memory management** - Currently we can only access the stack in |
| 67 | Kaleidoscope. It would also be useful to be able to allocate heap |
| 68 | memory, either with calls to the standard libc malloc/free interface |
| 69 | or with a garbage collector. If you would like to use garbage |
| 70 | collection, note that LLVM fully supports `Accurate Garbage |
| 71 | Collection <../GarbageCollection.html>`_ including algorithms that |
| 72 | move objects and need to scan/update the stack. |
| 73 | - **debugger support** - LLVM supports generation of `DWARF Debug |
| 74 | info <../SourceLevelDebugging.html>`_ which is understood by common |
| 75 | debuggers like GDB. Adding support for debug info is fairly |
| 76 | straightforward. The best way to understand it is to compile some |
| 77 | C/C++ code with "``llvm-gcc -g -O0``" and taking a look at what it |
| 78 | produces. |
| 79 | - **exception handling support** - LLVM supports generation of `zero |
| 80 | cost exceptions <../ExceptionHandling.html>`_ which interoperate with |
| 81 | code compiled in other languages. You could also generate code by |
| 82 | implicitly making every function return an error value and checking |
| 83 | it. You could also make explicit use of setjmp/longjmp. There are |
| 84 | many different ways to go here. |
| 85 | - **object orientation, generics, database access, complex numbers, |
| 86 | geometric programming, ...** - Really, there is no end of crazy |
| 87 | features that you can add to the language. |
| 88 | - **unusual domains** - We've been talking about applying LLVM to a |
| 89 | domain that many people are interested in: building a compiler for a |
| 90 | specific language. However, there are many other domains that can use |
| 91 | compiler technology that are not typically considered. For example, |
| 92 | LLVM has been used to implement OpenGL graphics acceleration, |
| 93 | translate C++ code to ActionScript, and many other cute and clever |
| 94 | things. Maybe you will be the first to JIT compile a regular |
| 95 | expression interpreter into native code with LLVM? |
| 96 | |
| 97 | Have fun - try doing something crazy and unusual. Building a language |
| 98 | like everyone else always has, is much less fun than trying something a |
| 99 | little crazy or off the wall and seeing how it turns out. If you get |
| 100 | stuck or want to talk about it, feel free to email the `llvmdev mailing |
| 101 | list <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_: it has lots |
| 102 | of people who are interested in languages and are often willing to help |
| 103 | out. |
| 104 | |
| 105 | Before we end this tutorial, I want to talk about some "tips and tricks" |
| 106 | for generating LLVM IR. These are some of the more subtle things that |
| 107 | may not be obvious, but are very useful if you want to take advantage of |
| 108 | LLVM's capabilities. |
| 109 | |
| 110 | Properties of the LLVM IR |
| 111 | ========================= |
| 112 | |
| 113 | We have a couple common questions about code in the LLVM IR form - lets |
| 114 | just get these out of the way right now, shall we? |
| 115 | |
| 116 | Target Independence |
| 117 | ------------------- |
| 118 | |
| 119 | Kaleidoscope is an example of a "portable language": any program written |
| 120 | in Kaleidoscope will work the same way on any target that it runs on. |
| 121 | Many other languages have this property, e.g. lisp, java, haskell, |
| 122 | javascript, python, etc (note that while these languages are portable, |
| 123 | not all their libraries are). |
| 124 | |
| 125 | One nice aspect of LLVM is that it is often capable of preserving target |
| 126 | independence in the IR: you can take the LLVM IR for a |
| 127 | Kaleidoscope-compiled program and run it on any target that LLVM |
| 128 | supports, even emitting C code and compiling that on targets that LLVM |
| 129 | doesn't support natively. You can trivially tell that the Kaleidoscope |
| 130 | compiler generates target-independent code because it never queries for |
| 131 | any target-specific information when generating code. |
| 132 | |
| 133 | The fact that LLVM provides a compact, target-independent, |
| 134 | representation for code gets a lot of people excited. Unfortunately, |
| 135 | these people are usually thinking about C or a language from the C |
| 136 | family when they are asking questions about language portability. I say |
| 137 | "unfortunately", because there is really no way to make (fully general) |
| 138 | C code portable, other than shipping the source code around (and of |
| 139 | course, C source code is not actually portable in general either - ever |
| 140 | port a really old application from 32- to 64-bits?). |
| 141 | |
| 142 | The problem with C (again, in its full generality) is that it is heavily |
| 143 | laden with target specific assumptions. As one simple example, the |
| 144 | preprocessor often destructively removes target-independence from the |
| 145 | code when it processes the input text: |
| 146 | |
| 147 | .. code-block:: c |
| 148 | |
| 149 | #ifdef __i386__ |
| 150 | int X = 1; |
| 151 | #else |
| 152 | int X = 42; |
| 153 | #endif |
| 154 | |
| 155 | While it is possible to engineer more and more complex solutions to |
| 156 | problems like this, it cannot be solved in full generality in a way that |
| 157 | is better than shipping the actual source code. |
| 158 | |
| 159 | That said, there are interesting subsets of C that can be made portable. |
| 160 | If you are willing to fix primitive types to a fixed size (say int = |
| 161 | 32-bits, and long = 64-bits), don't care about ABI compatibility with |
| 162 | existing binaries, and are willing to give up some other minor features, |
| 163 | you can have portable code. This can make sense for specialized domains |
| 164 | such as an in-kernel language. |
| 165 | |
| 166 | Safety Guarantees |
| 167 | ----------------- |
| 168 | |
| 169 | Many of the languages above are also "safe" languages: it is impossible |
| 170 | for a program written in Java to corrupt its address space and crash the |
| 171 | process (assuming the JVM has no bugs). Safety is an interesting |
| 172 | property that requires a combination of language design, runtime |
| 173 | support, and often operating system support. |
| 174 | |
| 175 | It is certainly possible to implement a safe language in LLVM, but LLVM |
| 176 | IR does not itself guarantee safety. The LLVM IR allows unsafe pointer |
| 177 | casts, use after free bugs, buffer over-runs, and a variety of other |
| 178 | problems. Safety needs to be implemented as a layer on top of LLVM and, |
| 179 | conveniently, several groups have investigated this. Ask on the `llvmdev |
| 180 | mailing list <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_ if |
| 181 | you are interested in more details. |
| 182 | |
| 183 | Language-Specific Optimizations |
| 184 | ------------------------------- |
| 185 | |
| 186 | One thing about LLVM that turns off many people is that it does not |
| 187 | solve all the world's problems in one system (sorry 'world hunger', |
| 188 | someone else will have to solve you some other day). One specific |
| 189 | complaint is that people perceive LLVM as being incapable of performing |
| 190 | high-level language-specific optimization: LLVM "loses too much |
| 191 | information". |
| 192 | |
| 193 | Unfortunately, this is really not the place to give you a full and |
| 194 | unified version of "Chris Lattner's theory of compiler design". Instead, |
| 195 | I'll make a few observations: |
| 196 | |
| 197 | First, you're right that LLVM does lose information. For example, as of |
| 198 | this writing, there is no way to distinguish in the LLVM IR whether an |
| 199 | SSA-value came from a C "int" or a C "long" on an ILP32 machine (other |
| 200 | than debug info). Both get compiled down to an 'i32' value and the |
| 201 | information about what it came from is lost. The more general issue |
| 202 | here, is that the LLVM type system uses "structural equivalence" instead |
| 203 | of "name equivalence". Another place this surprises people is if you |
| 204 | have two types in a high-level language that have the same structure |
| 205 | (e.g. two different structs that have a single int field): these types |
| 206 | will compile down into a single LLVM type and it will be impossible to |
| 207 | tell what it came from. |
| 208 | |
| 209 | Second, while LLVM does lose information, LLVM is not a fixed target: we |
| 210 | continue to enhance and improve it in many different ways. In addition |
| 211 | to adding new features (LLVM did not always support exceptions or debug |
| 212 | info), we also extend the IR to capture important information for |
| 213 | optimization (e.g. whether an argument is sign or zero extended, |
| 214 | information about pointers aliasing, etc). Many of the enhancements are |
| 215 | user-driven: people want LLVM to include some specific feature, so they |
| 216 | go ahead and extend it. |
| 217 | |
| 218 | Third, it is *possible and easy* to add language-specific optimizations, |
| 219 | and you have a number of choices in how to do it. As one trivial |
| 220 | example, it is easy to add language-specific optimization passes that |
| 221 | "know" things about code compiled for a language. In the case of the C |
| 222 | family, there is an optimization pass that "knows" about the standard C |
| 223 | library functions. If you call "exit(0)" in main(), it knows that it is |
| 224 | safe to optimize that into "return 0;" because C specifies what the |
| 225 | 'exit' function does. |
| 226 | |
| 227 | In addition to simple library knowledge, it is possible to embed a |
| 228 | variety of other language-specific information into the LLVM IR. If you |
| 229 | have a specific need and run into a wall, please bring the topic up on |
| 230 | the llvmdev list. At the very worst, you can always treat LLVM as if it |
| 231 | were a "dumb code generator" and implement the high-level optimizations |
| 232 | you desire in your front-end, on the language-specific AST. |
| 233 | |
| 234 | Tips and Tricks |
| 235 | =============== |
| 236 | |
| 237 | There is a variety of useful tips and tricks that you come to know after |
| 238 | working on/with LLVM that aren't obvious at first glance. Instead of |
| 239 | letting everyone rediscover them, this section talks about some of these |
| 240 | issues. |
| 241 | |
| 242 | Implementing portable offsetof/sizeof |
| 243 | ------------------------------------- |
| 244 | |
| 245 | One interesting thing that comes up, if you are trying to keep the code |
| 246 | generated by your compiler "target independent", is that you often need |
| 247 | to know the size of some LLVM type or the offset of some field in an |
| 248 | llvm structure. For example, you might need to pass the size of a type |
| 249 | into a function that allocates memory. |
| 250 | |
| 251 | Unfortunately, this can vary widely across targets: for example the |
| 252 | width of a pointer is trivially target-specific. However, there is a |
| 253 | `clever way to use the getelementptr |
| 254 | instruction <http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt>`_ |
| 255 | that allows you to compute this in a portable way. |
| 256 | |
| 257 | Garbage Collected Stack Frames |
| 258 | ------------------------------ |
| 259 | |
| 260 | Some languages want to explicitly manage their stack frames, often so |
| 261 | that they are garbage collected or to allow easy implementation of |
| 262 | closures. There are often better ways to implement these features than |
| 263 | explicit stack frames, but `LLVM does support |
| 264 | them, <http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt>`_ |
| 265 | if you want. It requires your front-end to convert the code into |
| 266 | `Continuation Passing |
| 267 | Style <http://en.wikipedia.org/wiki/Continuation-passing_style>`_ and |
| 268 | the use of tail calls (which LLVM also supports). |
| 269 | |