Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| 2 | "http://www.w3.org/TR/html4/strict.dtd"> |
| 3 | |
| 4 | <html> |
| 5 | <head> |
Chris Lattner | 99005a4 | 2007-11-05 19:10:15 +0000 | [diff] [blame] | 6 | <title>Kaleidoscope: Conclusion and other useful LLVM tidbits</title> |
Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 7 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| 8 | <meta name="author" content="Chris Lattner"> |
| 9 | <link rel="stylesheet" href="../llvm.css" type="text/css"> |
| 10 | </head> |
| 11 | |
| 12 | <body> |
| 13 | |
Chris Lattner | 99005a4 | 2007-11-05 19:10:15 +0000 | [diff] [blame] | 14 | <div class="doc_title">Kaleidoscope: Conclusion and other useful LLVM |
| 15 | tidbits</div> |
| 16 | |
| 17 | <ul> |
Chris Lattner | 0e555b1 | 2007-11-05 20:04:56 +0000 | [diff] [blame] | 18 | <li><a href="index.html">Up to Tutorial Index</a></li> |
Chris Lattner | 99005a4 | 2007-11-05 19:10:15 +0000 | [diff] [blame] | 19 | <li>Chapter 8 |
| 20 | <ol> |
| 21 | <li><a href="#conclusion">Tutorial Conclusion</a></li> |
| 22 | <li><a href="#llvmirproperties">Properties of LLVM IR</a> |
| 23 | <ul> |
| 24 | <li><a href="#targetindep">Target Independence</a></li> |
| 25 | <li><a href="#safety">Safety Guarantees</a></li> |
| 26 | <li><a href="#langspecific">Language-Specific Optimizations</a></li> |
| 27 | </ul> |
| 28 | </li> |
| 29 | <li><a href="#tipsandtricks">Tips and Tricks</a> |
| 30 | <ul> |
| 31 | <li><a href="#offsetofsizeof">Implementing portable |
| 32 | offsetof/sizeof</a></li> |
| 33 | <li><a href="#gcstack">Garbage Collected Stack Frames</a></li> |
| 34 | </ul> |
| 35 | </li> |
| 36 | </ol> |
| 37 | </li> |
| 38 | </ul> |
| 39 | |
Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 40 | |
| 41 | <div class="doc_author"> |
| 42 | <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> |
| 43 | </div> |
| 44 | |
| 45 | <!-- *********************************************************************** --> |
Chris Lattner | 99005a4 | 2007-11-05 19:10:15 +0000 | [diff] [blame] | 46 | <div class="doc_section"><a name="conclusion">Tutorial Conclusion</a></div> |
Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 47 | <!-- *********************************************************************** --> |
| 48 | |
| 49 | <div class="doc_text"> |
| 50 | |
| 51 | <p>Welcome to the the final chapter of the "<a href="index.html">Implementing a |
| 52 | language with LLVM</a>" tutorial. In the course of this tutorial, we have grown |
| 53 | our little Kaleidoscope language from being a useless toy, to being a |
| 54 | semi-interesting (but probably still useless) toy. :)</p> |
| 55 | |
| 56 | <p>It is interesting to see how far we've come, and how little code it has |
| 57 | taken. We built the entire lexer, parser, AST, code generator, and an |
| 58 | interactive run-loop (with a JIT!) by-hand in under 700 lines of |
| 59 | (non-comment/non-blank) code.</p> |
| 60 | |
| 61 | <p>Our little language supports a couple of interesting features: it supports |
| 62 | user defined binary and unary operators, it uses JIT compilation for immediate |
| 63 | evaluation, and it supports a few control flow constructs with SSA construction. |
| 64 | </p> |
| 65 | |
| 66 | <p>Part of the idea of this tutorial was to show you how easy and fun it can be |
| 67 | to define, build, and play with languages. Building a compiler need not be a |
| 68 | scary or mystical process! Now that you've seen some of the basics, I strongly |
| 69 | encourage you to take the code and hack on it. For example, try adding:</p> |
| 70 | |
| 71 | <ul> |
| 72 | <li><b>global variables</b> - While global variables have questional value in |
| 73 | modern software engineering, they are often useful when putting together quick |
| 74 | little hacks like the Kaleidoscope compiler itself. Fortunately, our current |
| 75 | setup makes it very easy to add global variables: just have value lookup check |
| 76 | to see if an unresolved variable is in the global variable symbol table before |
| 77 | rejecting it. To create a new global variable, make an instance of the LLVM |
| 78 | <tt>GlobalVariable</tt> class.</li> |
| 79 | |
| 80 | <li><b>typed variables</b> - Kaleidoscope currently only supports variables of |
| 81 | type double. This gives the language a very nice elegance, because only |
| 82 | supporting one type means that you never have to specify types. Different |
| 83 | languages have different ways of handling this. The easiest way is to require |
| 84 | the user to specify types for every variable definition, and record the type |
| 85 | of the variable in the symbol table along with its Value*.</li> |
| 86 | |
| 87 | <li><b>arrays, structs, vectors, etc</b> - Once you add types, you can start |
| 88 | extending the type system in all sorts of interesting ways. Simple arrays are |
| 89 | very easy and are quite useful for many different applications. Adding them is |
| 90 | mostly an exercise in learning how the LLVM <a |
| 91 | href="../LangRef.html#i_getelementptr">getelementptr</a> instruction works. |
| 92 | The getelementptr instruction is so nifty/unconventional, it <a |
Chris Lattner | 9ca08f3 | 2007-11-05 19:28:07 +0000 | [diff] [blame] | 93 | href="../GetElementPtr.html">has its own FAQ</a>!). If you add support |
| 94 | for recursive types (e.g. linked lists), make sure to read the <a |
| 95 | href="../ProgrammersManual.html#TypeResolve">section in the LLVM |
| 96 | Programmer's Manual</a> that describes how to construct them.</li> |
Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 97 | |
| 98 | <li><b>standard runtime</b> - Our current language allows the user to access |
| 99 | arbitrary external functions, and we use it for things like "printd" and |
| 100 | "putchard". As you extend the language to add higher-level constructs, often |
| 101 | these constructs make the most amount of sense to be lowered into calls into a |
| 102 | language-supplied runtime. For example, if you add hash tables to the language, |
| 103 | it would probably make sense to add the routines to a runtime, instead of |
| 104 | inlining them all the way.</li> |
| 105 | |
| 106 | <li><b>memory management</b> - Currently we can only access the stack in |
| 107 | Kaleidoscope. It would also be useful to be able to allocate heap memory, |
| 108 | either with calls to the standard libc malloc/free interface or with a garbage |
| 109 | collector. If you choose to use garbage collection, note that LLVM fully |
| 110 | supports <a href="../GarbageCollection.html">Accurate Garbage Collection</a> |
| 111 | including algorithms that move objects and need to scan/update the stack.</li> |
| 112 | |
| 113 | <li><b>debugger support</b> - LLVM supports generation of <a |
| 114 | href="../SourceLevelDebugging.html">DWARF Debug info</a> which is understood by |
| 115 | common debuggers like GDB. Adding support for debug info is fairly |
| 116 | straight-forward. The best way to understand it is to compile some C/C++ code |
| 117 | with "<tt>llvm-gcc -g -O0</tt>" and taking a look at what it produces.</li> |
| 118 | |
Chris Lattner | a3f07ef | 2007-11-05 07:00:54 +0000 | [diff] [blame] | 119 | <li><b>exception handling support</b> - LLVM supports generation of <a |
Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 120 | href="../ExceptionHandling.html">zero cost exceptions</a> which interoperate |
| 121 | with code compiled in other languages. You could also generate code by |
| 122 | implicitly making every function return an error value and checking it. You |
| 123 | could also make explicit use of setjmp/longjmp. There are many different ways |
| 124 | to go here.</li> |
| 125 | |
| 126 | <li><b>object orientation, generics, database access, complex numbers, |
| 127 | geometric programming, ...</b> - Really, there is |
| 128 | no end of crazy features that you can add to the language.</li> |
| 129 | |
Chris Lattner | a3f07ef | 2007-11-05 07:00:54 +0000 | [diff] [blame] | 130 | <li><b>unusual domains</b> - We've been talking about applying LLVM to a domain |
| 131 | that many people are interested in: building a compiler for a specific language. |
| 132 | However, there are many other domains that can use compiler technology that are |
| 133 | not typically considered. For example, LLVM has been used to implement OpenGL |
| 134 | graphics acceleration, translate C++ code to ActionScript, and many other |
| 135 | cute and clever things. Maybe you will be the first to JIT compile a regular |
| 136 | expression interpreter into native code with LLVM?</li> |
| 137 | |
Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 138 | </ul> |
| 139 | |
| 140 | <p> |
| 141 | Have fun - try doing something crazy and unusual. Building a language like |
| 142 | everyone else always has is much less fun than trying something a little crazy |
| 143 | and off the wall and seeing how it turns out. If you get stuck or want to talk |
| 144 | about it, feel free to email the <a |
| 145 | href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">llvmdev mailing |
| 146 | list</a>: it has lots of people who are interested in languages and are often |
| 147 | willing to help out. |
| 148 | </p> |
| 149 | |
| 150 | <p>Before we end, I want to talk about some "tips and tricks" for generating |
| 151 | LLVM IR. These are some of the more subtle things that may not be obvious, but |
| 152 | are very useful if you want to take advantage of LLVM's capabilities.</p> |
| 153 | |
| 154 | </div> |
| 155 | |
| 156 | <!-- *********************************************************************** --> |
Chris Lattner | a3f07ef | 2007-11-05 07:00:54 +0000 | [diff] [blame] | 157 | <div class="doc_section"><a name="llvmirproperties">Properties of LLVM |
| 158 | IR</a></div> |
| 159 | <!-- *********************************************************************** --> |
| 160 | |
| 161 | <div class="doc_text"> |
| 162 | |
| 163 | <p>We have a couple common questions about code in the LLVM IR form, lets just |
| 164 | get these out of the way right now shall we?</p> |
| 165 | |
| 166 | </div> |
| 167 | |
| 168 | <!-- ======================================================================= --> |
| 169 | <div class="doc_subsubsection"><a name="targetindep">Target |
| 170 | Independence</a></div> |
| 171 | <!-- ======================================================================= --> |
| 172 | |
| 173 | <div class="doc_text"> |
| 174 | |
| 175 | <p>Kaleidoscope is an example of a "portable language": any program written in |
| 176 | Kaleidoscope will work the same way on any target that it runs on. Many other |
| 177 | languages have this property, e.g. lisp, java, haskell, javascript, python, etc |
| 178 | (note that while these languages are portable, not all their libraries are).</p> |
| 179 | |
| 180 | <p>One nice aspect of LLVM is that it is often capable of preserving language |
| 181 | independence in the IR: you can take the LLVM IR for a Kaleidoscope-compiled |
| 182 | program and run it on any target that LLVM supports, even emitting C code and |
| 183 | compiling that on targets that LLVM doesn't support natively. You can trivially |
| 184 | tell that the Kaleidoscope compiler generates target-independent code because it |
| 185 | never queries for any target-specific information when generating code.</p> |
| 186 | |
| 187 | <p>The fact that LLVM provides a compact target-independent representation for |
| 188 | code gets a lot of people excited. Unfortunately, these people are usually |
| 189 | thinking about C or a language from the C family when they are asking questions |
| 190 | about language portability. I say "unfortunately", because there is really no |
| 191 | way to make (fully general) C code portable, other than shipping the source code |
| 192 | around (and of course, C source code is not actually portable in general |
| 193 | either - ever port a really old application from 32- to 64-bits?).</p> |
| 194 | |
| 195 | <p>The problem with C (again, in its full generality) is that it is heavily |
| 196 | laden with target specific assumptions. As one simple example, the preprocessor |
| 197 | often destructively removes target-independence from the code when it processes |
| 198 | the input text:</p> |
| 199 | |
| 200 | <div class="doc_code"> |
| 201 | <pre> |
| 202 | #ifdef __i386__ |
| 203 | int X = 1; |
| 204 | #else |
| 205 | int X = 42; |
| 206 | #endif |
| 207 | </pre> |
| 208 | </div> |
| 209 | |
| 210 | <p>While it is possible to engineer more and more complex solutions to problems |
| 211 | like this, it cannot be solved in full generality in a way better than shipping |
| 212 | the actual source code.</p> |
| 213 | |
| 214 | <p>That said, there are interesting subsets of C that can be made portable. If |
| 215 | you are willing to fix primitive types to a fixed size (say int = 32-bits, |
| 216 | and long = 64-bits), don't care about ABI compatibility with existing binaries, |
| 217 | and are willing to give up some other minor features, you can have portable |
| 218 | code. This can even make real sense for specialized domains such as an |
| 219 | in-kernel language.</p> |
| 220 | |
| 221 | </div> |
| 222 | |
| 223 | <!-- ======================================================================= --> |
| 224 | <div class="doc_subsubsection"><a name="safety">Safety Guarantees</a></div> |
| 225 | <!-- ======================================================================= --> |
| 226 | |
| 227 | <div class="doc_text"> |
| 228 | |
| 229 | <p>Many of the languages above are also "safe" languages: it is impossible for |
| 230 | a program written in Java to corrupt its address space and crash the process. |
| 231 | Safety is an interesting property that requires a combination of language |
| 232 | design, runtime support, and often operating system support.</p> |
| 233 | |
| 234 | <p>It is certainly possible to implement a safe language in LLVM, but LLVM IR |
| 235 | does not itself guarantee safety. The LLVM IR allows unsafe pointer casts, |
| 236 | use after free bugs, buffer over-runs, and a variety of other problems. Safety |
| 237 | needs to be implemented as a layer on top of LLVM and, conveniently, several |
| 238 | groups have investigated this. Ask on the <a |
| 239 | href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">llvmdev mailing |
| 240 | list</a> if you are interested in more details.</p> |
| 241 | |
| 242 | </div> |
| 243 | |
| 244 | <!-- ======================================================================= --> |
| 245 | <div class="doc_subsubsection"><a name="langspecific">Language-Specific |
| 246 | Optimizations</a></div> |
| 247 | <!-- ======================================================================= --> |
| 248 | |
| 249 | <div class="doc_text"> |
| 250 | |
| 251 | <p>One thing about LLVM that turns off many people is that it does not solve all |
| 252 | the world's problems in one system (sorry 'world hunger', someone else will have |
| 253 | to solve you some other day). One specific complaint is that people perceive |
| 254 | LLVM as being incapable of performing high-level language-specific optimization: |
| 255 | LLVM "loses too much information".</p> |
| 256 | |
| 257 | <p>Unfortunately, this is really not the place to give you a full and unified |
| 258 | version of "Chris Lattner's theory of compiler design". Instead, I'll make a |
| 259 | few observations:</p> |
| 260 | |
| 261 | <p>First, you're right that LLVM does lose information. For example, as of this |
| 262 | writing, there is no way to distinguish in the LLVM IR whether an SSA-value came |
| 263 | from a C "int" or a C "long" on an ILP32 machine (other than debug info). Both |
| 264 | get compiled down to an 'i32' value and the information about what it came from |
| 265 | is lost. The more general issue here is that the LLVM type system uses |
| 266 | "structural equivalence" instead of "name equivalence". Another place this |
| 267 | surprises people is if you have two types in a high-level language that have the |
| 268 | same structure (e.g. two different structs that have a single int field): these |
| 269 | types will compile down into a single LLVM type and it will be impossible to |
| 270 | tell what it came from.</p> |
| 271 | |
| 272 | <p>Second, while LLVM does lose information, LLVM is not a fixed target: we |
| 273 | continue to enhance and improve it in many different ways. In addition to |
| 274 | adding new features (LLVM did not always support exceptions or debug info), we |
| 275 | also extend the IR to capture important information for optimization (e.g. |
| 276 | whether an argument is sign or zero extended, information about pointers |
| 277 | aliasing, etc. Many of the enhancements are user-driven: people want LLVM to |
| 278 | do some specific feature, so they go ahead and extend it to do so.</p> |
| 279 | |
| 280 | <p>Third, it <em>is certainly possible</em> to add language-specific |
| 281 | optimizations, and you have a number of choices in how to do it. As one trivial |
| 282 | example, it is possible to add language-specific optimization passes that |
Chris Lattner | efcc3f5 | 2007-11-06 01:40:41 +0000 | [diff] [blame^] | 283 | "know" things about code compiled for a language. In the case of the C family, |
Chris Lattner | a3f07ef | 2007-11-05 07:00:54 +0000 | [diff] [blame] | 284 | there is an optimziation pass that "knows" about the standard C library |
| 285 | functions. If you call "exit(0)" in main(), it knows that it is safe to |
| 286 | optimize that into "return 0;" for example, because C specifies what the 'exit' |
| 287 | function does.</p> |
| 288 | |
| 289 | <p>In addition to simple library knowledge, it is possible to embed a variety of |
| 290 | other language-specific information into the LLVM IR. If you have a specific |
| 291 | need and run into a wall, please bring the topic up on the llvmdev list. At the |
| 292 | very worst, you can always treat LLVM as if it were a "dumb code generator" and |
| 293 | implement the high-level optimizations you desire in your front-end on the |
| 294 | language-specific AST. |
| 295 | </p> |
| 296 | |
| 297 | </div> |
| 298 | |
| 299 | <!-- *********************************************************************** --> |
Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 300 | <div class="doc_section"><a name="tipsandtricks">Tips and Tricks</a></div> |
| 301 | <!-- *********************************************************************** --> |
| 302 | |
| 303 | <div class="doc_text"> |
| 304 | |
Chris Lattner | a3f07ef | 2007-11-05 07:00:54 +0000 | [diff] [blame] | 305 | <p>There is a variety of useful tips and tricks that you come to know after |
| 306 | working on/with LLVM that aren't obvious at first glance. Instead of letting |
| 307 | everyone rediscover them, this section talks about some of these issues.</p> |
| 308 | |
| 309 | </div> |
| 310 | |
| 311 | <!-- ======================================================================= --> |
| 312 | <div class="doc_subsubsection"><a name="offsetofsizeof">Implementing portable |
| 313 | offsetof/sizeof</a></div> |
| 314 | <!-- ======================================================================= --> |
| 315 | |
| 316 | <div class="doc_text"> |
| 317 | |
| 318 | <p>One interesting thing that comes up if you are trying to keep the code |
| 319 | generated by your compiler "target independent" is that you often need to know |
| 320 | the size of some LLVM type or the offset of some field in an llvm structure. |
| 321 | For example, you might need to pass the size of a type into a function that |
| 322 | allocates memory.</p> |
| 323 | |
| 324 | <p>Unfortunately, this can vary widely across targets: for example the width of |
| 325 | a pointer is trivially target-specific. However, there is a <a |
| 326 | href="http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt">clever |
| 327 | way to use the getelementptr instruction</a> that allows you to compute this |
| 328 | in a portable way.</p> |
| 329 | |
| 330 | </div> |
| 331 | |
| 332 | <!-- ======================================================================= --> |
| 333 | <div class="doc_subsubsection"><a name="gcstack">Garbage Collected |
| 334 | Stack Frames</a></div> |
| 335 | <!-- ======================================================================= --> |
| 336 | |
| 337 | <div class="doc_text"> |
| 338 | |
| 339 | <p>Some languages want to explicitly manage their stack frames, often so that |
| 340 | they are garbage collected or to allow easy implementation of closures. There |
| 341 | are often better ways to implement these features than explicit stack frames, |
| 342 | but <a |
| 343 | href="http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt">LLVM |
| 344 | does support them if you want</a>. It requires your front-end to convert the |
| 345 | code into <a |
| 346 | href="http://en.wikipedia.org/wiki/Continuation-passing_style">Continuation |
| 347 | Passing Style</a> and use of tail calls (which LLVM also supports).</p> |
Chris Lattner | b8fc650 | 2007-11-05 01:58:13 +0000 | [diff] [blame] | 348 | |
| 349 | </div> |
| 350 | |
| 351 | <!-- *********************************************************************** --> |
| 352 | <hr> |
| 353 | <address> |
| 354 | <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| 355 | src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> |
| 356 | <a href="http://validator.w3.org/check/referer"><img |
| 357 | src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> |
| 358 | |
| 359 | <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> |
| 360 | <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> |
| 361 | Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $ |
| 362 | </address> |
| 363 | </body> |
| 364 | </html> |