sewardj | de4a1d0 | 2002-03-22 01:27:54 +0000 | [diff] [blame^] | 1 | <html> |
| 2 | <head> |
| 3 | <style type="text/css"> |
| 4 | body { background-color: #ffffff; |
| 5 | color: #000000; |
| 6 | font-family: Times, Helvetica, Arial; |
| 7 | font-size: 14pt} |
| 8 | h4 { margin-bottom: 0.3em} |
| 9 | code { color: #000000; |
| 10 | font-family: Courier; |
| 11 | font-size: 13pt } |
| 12 | pre { color: #000000; |
| 13 | font-family: Courier; |
| 14 | font-size: 13pt } |
| 15 | a:link { color: #0000C0; |
| 16 | text-decoration: none; } |
| 17 | a:visited { color: #0000C0; |
| 18 | text-decoration: none; } |
| 19 | a:active { color: #0000C0; |
| 20 | text-decoration: none; } |
| 21 | </style> |
| 22 | <title>The design and implementation of Valgrind</title> |
| 23 | </head> |
| 24 | |
| 25 | <body bgcolor="#ffffff"> |
| 26 | |
| 27 | <a name="title"> </a> |
| 28 | <h1 align=center>The design and implementation of Valgrind</h1> |
| 29 | |
| 30 | <center> |
| 31 | Detailed technical notes for hackers, maintainers and the |
| 32 | overly-curious<br> |
| 33 | These notes pertain to snapshot 20020306<br> |
| 34 | <p> |
| 35 | <a href="mailto:jseward@acm.org">jseward@acm.org<br> |
| 36 | <a href="http://developer.kde.org/~sewardj">http://developer.kde.org/~sewardj</a><br> |
| 37 | <a href="http://www.muraroa.demon.co.uk">http://www.muraroa.demon.co.uk</a><br> |
| 38 | Copyright © 2000-2002 Julian Seward |
| 39 | <p> |
| 40 | Valgrind is licensed under the GNU General Public License, |
| 41 | version 2<br> |
| 42 | An open-source tool for finding memory-management problems in |
| 43 | x86 GNU/Linux executables. |
| 44 | </center> |
| 45 | |
| 46 | <p> |
| 47 | |
| 48 | |
| 49 | |
| 50 | |
| 51 | <hr width="100%"> |
| 52 | |
| 53 | <h2>Introduction</h2> |
| 54 | |
| 55 | This document contains a detailed, highly-technical description of the |
| 56 | internals of Valgrind. This is not the user manual; if you are an |
| 57 | end-user of Valgrind, you do not want to read this. Conversely, if |
| 58 | you really are a hacker-type and want to know how it works, I assume |
| 59 | that you have read the user manual thoroughly. |
| 60 | <p> |
| 61 | You may need to read this document several times, and carefully. Some |
| 62 | important things, I only say once. |
| 63 | |
| 64 | |
| 65 | <h3>History</h3> |
| 66 | |
| 67 | Valgrind came into public view in late Feb 2002. However, it has been |
| 68 | under contemplation for a very long time, perhaps seriously for about |
| 69 | five years. Somewhat over two years ago, I started working on the x86 |
| 70 | code generator for the Glasgow Haskell Compiler |
| 71 | (http://www.haskell.org/ghc), gaining familiarity with x86 internals |
| 72 | on the way. I then did Cacheprof (http://www.cacheprof.org), gaining |
| 73 | further x86 experience. Some time around Feb 2000 I started |
| 74 | experimenting with a user-space x86 interpreter for x86-Linux. This |
| 75 | worked, but it was clear that a JIT-based scheme would be necessary to |
| 76 | give reasonable performance for Valgrind. Design work for the JITter |
| 77 | started in earnest in Oct 2000, and by early 2001 I had an x86-to-x86 |
| 78 | dynamic translator which could run quite large programs. This |
| 79 | translator was in a sense pointless, since it did not do any |
| 80 | instrumentation or checking. |
| 81 | |
| 82 | <p> |
| 83 | Most of the rest of 2001 was taken up designing and implementing the |
| 84 | instrumentation scheme. The main difficulty, which consumed a lot |
| 85 | of effort, was to design a scheme which did not generate large numbers |
| 86 | of false uninitialised-value warnings. By late 2001 a satisfactory |
| 87 | scheme had been arrived at, and I started to test it on ever-larger |
| 88 | programs, with an eventual eye to making it work well enough so that |
| 89 | it was helpful to folks debugging the upcoming version 3 of KDE. I've |
| 90 | used KDE since before version 1.0, and wanted to Valgrind to be an |
| 91 | indirect contribution to the KDE 3 development effort. At the start of |
| 92 | Feb 02 the kde-core-devel crew started using it, and gave a huge |
| 93 | amount of helpful feedback and patches in the space of three weeks. |
| 94 | Snapshot 20020306 is the result. |
| 95 | |
| 96 | <p> |
| 97 | In the best Unix tradition, or perhaps in the spirit of Fred Brooks' |
| 98 | depressing-but-completely-accurate epitaph "build one to throw away; |
| 99 | you will anyway", much of Valgrind is a second or third rendition of |
| 100 | the initial idea. The instrumentation machinery |
| 101 | (<code>vg_translate.c</code>, <code>vg_memory.c</code>) and core CPU |
| 102 | simulation (<code>vg_to_ucode.c</code>, <code>vg_from_ucode.c</code>) |
| 103 | have had three redesigns and rewrites; the register allocator, |
| 104 | low-level memory manager (<code>vg_malloc2.c</code>) and symbol table |
| 105 | reader (<code>vg_symtab2.c</code>) are on the second rewrite. In a |
| 106 | sense, this document serves to record some of the knowledge gained as |
| 107 | a result. |
| 108 | |
| 109 | |
| 110 | <h3>Design overview</h3> |
| 111 | |
| 112 | Valgrind is compiled into a Linux shared object, |
| 113 | <code>valgrind.so</code>, and also a dummy one, |
| 114 | <code>valgrinq.so</code>, of which more later. The |
| 115 | <code>valgrind</code> shell script adds <code>valgrind.so</code> to |
| 116 | the <code>LD_PRELOAD</code> list of extra libraries to be |
| 117 | loaded with any dynamically linked library. This is a standard trick, |
| 118 | one which I assume the <code>LD_PRELOAD</code> mechanism was developed |
| 119 | to support. |
| 120 | |
| 121 | <p> |
| 122 | <code>valgrind.so</code> |
| 123 | is linked with the <code>-z initfirst</code> flag, which requests that |
| 124 | its initialisation code is run before that of any other object in the |
| 125 | executable image. When this happens, valgrind gains control. The |
| 126 | real CPU becomes "trapped" in <code>valgrind.so</code> and the |
| 127 | translations it generates. The synthetic CPU provided by Valgrind |
| 128 | does, however, return from this initialisation function. So the |
| 129 | normal startup actions, orchestrated by the dynamic linker |
| 130 | <code>ld.so</code>, continue as usual, except on the synthetic CPU, |
| 131 | not the real one. Eventually <code>main</code> is run and returns, |
| 132 | and then the finalisation code of the shared objects is run, |
| 133 | presumably in inverse order to which they were initialised. Remember, |
| 134 | this is still all happening on the simulated CPU. Eventually |
| 135 | <code>valgrind.so</code>'s own finalisation code is called. It spots |
| 136 | this event, shuts down the simulated CPU, prints any error summaries |
| 137 | and/or does leak detection, and returns from the initialisation code |
| 138 | on the real CPU. At this point, in effect the real and synthetic CPUs |
| 139 | have merged back into one, Valgrind has lost control of the program, |
| 140 | and the program finally <code>exit()s</code> back to the kernel in the |
| 141 | usual way. |
| 142 | |
| 143 | <p> |
| 144 | The normal course of activity, one Valgrind has started up, is as |
| 145 | follows. Valgrind never runs any part of your program (usually |
| 146 | referred to as the "client"), not a single byte of it, directly. |
| 147 | Instead it uses function <code>VG_(translate)</code> to translate |
| 148 | basic blocks (BBs, straight-line sequences of code) into instrumented |
| 149 | translations, and those are run instead. The translations are stored |
| 150 | in the translation cache (TC), <code>vg_tc</code>, with the |
| 151 | translation table (TT), <code>vg_tt</code> supplying the |
| 152 | original-to-translation code address mapping. Auxiliary array |
| 153 | <code>VG_(tt_fast)</code> is used as a direct-map cache for fast |
| 154 | lookups in TT; it usually achieves a hit rate of around 98% and |
| 155 | facilitates an orig-to-trans lookup in 4 x86 insns, which is not bad. |
| 156 | |
| 157 | <p> |
| 158 | Function <code>VG_(dispatch)</code> in <code>vg_dispatch.S</code> is |
| 159 | the heart of the JIT dispatcher. Once a translated code address has |
| 160 | been found, it is executed simply by an x86 <code>call</code> |
| 161 | to the translation. At the end of the translation, the next |
| 162 | original code addr is loaded into <code>%eax</code>, and the |
| 163 | translation then does a <code>ret</code>, taking it back to the |
| 164 | dispatch loop, with, interestingly, zero branch mispredictions. |
| 165 | The address requested in <code>%eax</code> is looked up first in |
| 166 | <code>VG_(tt_fast)</code>, and, if not found, by calling C helper |
| 167 | <code>VG_(search_transtab)</code>. If there is still no translation |
| 168 | available, <code>VG_(dispatch)</code> exits back to the top-level |
| 169 | C dispatcher <code>VG_(toploop)</code>, which arranges for |
| 170 | <code>VG_(translate)</code> to make a new translation. All fairly |
| 171 | unsurprising, really. There are various complexities described below. |
| 172 | |
| 173 | <p> |
| 174 | The translator, orchestrated by <code>VG_(translate)</code>, is |
| 175 | complicated but entirely self-contained. It is described in great |
| 176 | detail in subsequent sections. Translations are stored in TC, with TT |
| 177 | tracking administrative information. The translations are subject to |
| 178 | an approximate LRU-based management scheme. With the current |
| 179 | settings, the TC can hold at most about 15MB of translations, and LRU |
| 180 | passes prune it to about 13.5MB. Given that the |
| 181 | orig-to-translation expansion ratio is about 13:1 to 14:1, this means |
| 182 | TC holds translations for more or less a megabyte of original code, |
| 183 | which generally comes to about 70000 basic blocks for C++ compiled |
| 184 | with optimisation on. Generating new translations is expensive, so it |
| 185 | is worth having a large TC to minimise the (capacity) miss rate. |
| 186 | |
| 187 | <p> |
| 188 | The dispatcher, <code>VG_(dispatch)</code>, receives hints from |
| 189 | the translations which allow it to cheaply spot all control |
| 190 | transfers corresponding to x86 <code>call</code> and <code>ret</code> |
| 191 | instructions. It has to do this in order to spot some special events: |
| 192 | <ul> |
| 193 | <li>Calls to <code>VG_(shutdown)</code>. This is Valgrind's cue to |
| 194 | exit. NOTE: actually this is done a different way; it should be |
| 195 | cleaned up. |
| 196 | <p> |
| 197 | <li>Returns of system call handlers, to the return address |
| 198 | <code>VG_(signalreturn_bogusRA)</code>. The signal simulator |
| 199 | needs to know when a signal handler is returning, so we spot |
| 200 | jumps (returns) to this address. |
| 201 | <p> |
| 202 | <li>Calls to <code>vg_trap_here</code>. All <code>malloc</code>, |
| 203 | <code>free</code>, etc calls that the client program makes are |
| 204 | eventually routed to a call to <code>vg_trap_here</code>, |
| 205 | and Valgrind does its own special thing with these calls. |
| 206 | In effect this provides a trapdoor, by which Valgrind can |
| 207 | intercept certain calls on the simulated CPU, run the call as it |
| 208 | sees fit itself (on the real CPU), and return the result to |
| 209 | the simulated CPU, quite transparently to the client program. |
| 210 | </ul> |
| 211 | Valgrind intercepts the client's <code>malloc</code>, |
| 212 | <code>free</code>, etc, |
| 213 | calls, so that it can store additional information. Each block |
| 214 | <code>malloc</code>'d by the client gives rise to a shadow block |
| 215 | in which Valgrind stores the call stack at the time of the |
| 216 | <code>malloc</code> |
| 217 | call. When the client calls <code>free</code>, Valgrind tries to |
| 218 | find the shadow block corresponding to the address passed to |
| 219 | <code>free</code>, and emits an error message if none can be found. |
| 220 | If it is found, the block is placed on the freed blocks queue |
| 221 | <code>vg_freed_list</code>, it is marked as inaccessible, and |
| 222 | its shadow block now records the call stack at the time of the |
| 223 | <code>free</code> call. Keeping <code>free</code>'d blocks in |
| 224 | this queue allows Valgrind to spot all (presumably invalid) accesses |
| 225 | to them. However, once the volume of blocks in the free queue |
| 226 | exceeds <code>VG_(clo_freelist_vol)</code>, blocks are finally |
| 227 | removed from the queue. |
| 228 | |
| 229 | <p> |
| 230 | Keeping track of A and V bits (note: if you don't know what these are, |
| 231 | you haven't read the user guide carefully enough) for memory is done |
| 232 | in <code>vg_memory.c</code>. This implements a sparse array structure |
| 233 | which covers the entire 4G address space in a way which is reasonably |
| 234 | fast and reasonably space efficient. The 4G address space is divided |
| 235 | up into 64K sections, each covering 64Kb of address space. Given a |
| 236 | 32-bit address, the top 16 bits are used to select one of the 65536 |
| 237 | entries in <code>VG_(primary_map)</code>. The resulting "secondary" |
| 238 | (<code>SecMap</code>) holds A and V bits for the 64k of address space |
| 239 | chunk corresponding to the lower 16 bits of the address. |
| 240 | |
| 241 | |
| 242 | <h3>Design decisions</h3> |
| 243 | |
| 244 | Some design decisions were motivated by the need to make Valgrind |
| 245 | debuggable. Imagine you are writing a CPU simulator. It works fairly |
| 246 | well. However, you run some large program, like Netscape, and after |
| 247 | tens of millions of instructions, it crashes. How can you figure out |
| 248 | where in your simulator the bug is? |
| 249 | |
| 250 | <p> |
| 251 | Valgrind's answer is: cheat. Valgrind is designed so that it is |
| 252 | possible to switch back to running the client program on the real |
| 253 | CPU at any point. Using the <code>--stop-after= </code> flag, you can |
| 254 | ask Valgrind to run just some number of basic blocks, and then |
| 255 | run the rest of the way on the real CPU. If you are searching for |
| 256 | a bug in the simulated CPU, you can use this to do a binary search, |
| 257 | which quickly leads you to the specific basic block which is |
| 258 | causing the problem. |
| 259 | |
| 260 | <p> |
| 261 | This is all very handy. It does constrain the design in certain |
| 262 | unimportant ways. Firstly, the layout of memory, when viewed from the |
| 263 | client's point of view, must be identical regardless of whether it is |
| 264 | running on the real or simulated CPU. This means that Valgrind can't |
| 265 | do pointer swizzling -- well, no great loss -- and it can't run on |
| 266 | the same stack as the client -- again, no great loss. |
| 267 | Valgrind operates on its own stack, <code>VG_(stack)</code>, which |
| 268 | it switches to at startup, temporarily switching back to the client's |
| 269 | stack when doing system calls for the client. |
| 270 | |
| 271 | <p> |
| 272 | Valgrind also receives signals on its own stack, |
| 273 | <code>VG_(sigstack)</code>, but for different gruesome reasons |
| 274 | discussed below. |
| 275 | |
| 276 | <p> |
| 277 | This nice clean switch-back-to-the-real-CPU-whenever-you-like story |
| 278 | is muddied by signals. Problem is that signals arrive at arbitrary |
| 279 | times and tend to slightly perturb the basic block count, with the |
| 280 | result that you can get close to the basic block causing a problem but |
| 281 | can't home in on it exactly. My kludgey hack is to define |
| 282 | <code>SIGNAL_SIMULATION</code> to 1 towards the bottom of |
| 283 | <code>vg_syscall_mem.c</code>, so that signal handlers are run on the |
| 284 | real CPU and don't change the BB counts. |
| 285 | |
| 286 | <p> |
| 287 | A second hole in the switch-back-to-real-CPU story is that Valgrind's |
| 288 | way of delivering signals to the client is different from that of the |
| 289 | kernel. Specifically, the layout of the signal delivery frame, and |
| 290 | the mechanism used to detect a sighandler returning, are different. |
| 291 | So you can't expect to make the transition inside a sighandler and |
| 292 | still have things working, but in practice that's not much of a |
| 293 | restriction. |
| 294 | |
| 295 | <p> |
| 296 | Valgrind's implementation of <code>malloc</code>, <code>free</code>, |
| 297 | etc, (in <code>vg_clientmalloc.c</code>, not the low-level stuff in |
| 298 | <code>vg_malloc2.c</code>) is somewhat complicated by the need to |
| 299 | handle switching back at arbitrary points. It does work tho. |
| 300 | |
| 301 | |
| 302 | |
| 303 | <h3>Correctness</h3> |
| 304 | |
| 305 | There's only one of me, and I have a Real Life (tm) as well as hacking |
| 306 | Valgrind [allegedly :-]. That means I don't have time to waste |
| 307 | chasing endless bugs in Valgrind. My emphasis is therefore on doing |
| 308 | everything as simply as possible, with correctness, stability and |
| 309 | robustness being the number one priority, more important than |
| 310 | performance or functionality. As a result: |
| 311 | <ul> |
| 312 | <li>The code is absolutely loaded with assertions, and these are |
| 313 | <b>permanently enabled.</b> I have no plan to remove or disable |
| 314 | them later. Over the past couple of months, as valgrind has |
| 315 | become more widely used, they have shown their worth, pulling |
| 316 | up various bugs which would otherwise have appeared as |
| 317 | hard-to-find segmentation faults. |
| 318 | <p> |
| 319 | I am of the view that it's acceptable to spend 5% of the total |
| 320 | running time of your valgrindified program doing assertion checks |
| 321 | and other internal sanity checks. |
| 322 | <p> |
| 323 | <li>Aside from the assertions, valgrind contains various sets of |
| 324 | internal sanity checks, which get run at varying frequencies |
| 325 | during normal operation. <code>VG_(do_sanity_checks)</code> |
| 326 | runs every 1000 basic blocks, which means 500 to 2000 times/second |
| 327 | for typical machines at present. It checks that Valgrind hasn't |
| 328 | overrun its private stack, and does some simple checks on the |
| 329 | memory permissions maps. Once every 25 calls it does some more |
| 330 | extensive checks on those maps. Etc, etc. |
| 331 | <p> |
| 332 | The following components also have sanity check code, which can |
| 333 | be enabled to aid debugging: |
| 334 | <ul> |
| 335 | <li>The low-level memory-manager |
| 336 | (<code>VG_(mallocSanityCheckArena)</code>). This does a |
| 337 | complete check of all blocks and chains in an arena, which |
| 338 | is very slow. Is not engaged by default. |
| 339 | <p> |
| 340 | <li>The symbol table reader(s): various checks to ensure |
| 341 | uniqueness of mappings; see <code>VG_(read_symbols)</code> |
| 342 | for a start. Is permanently engaged. |
| 343 | <p> |
| 344 | <li>The A and V bit tracking stuff in <code>vg_memory.c</code>. |
| 345 | This can be compiled with cpp symbol |
| 346 | <code>VG_DEBUG_MEMORY</code> defined, which removes all the |
| 347 | fast, optimised cases, and uses simple-but-slow fallbacks |
| 348 | instead. Not engaged by default. |
| 349 | <p> |
| 350 | <li>Ditto <code>VG_DEBUG_LEAKCHECK</code>. |
| 351 | <p> |
| 352 | <li>The JITter parses x86 basic blocks into sequences of |
| 353 | UCode instructions. It then sanity checks each one with |
| 354 | <code>VG_(saneUInstr)</code> and sanity checks the sequence |
| 355 | as a whole with <code>VG_(saneUCodeBlock)</code>. This stuff |
| 356 | is engaged by default, and has caught some way-obscure bugs |
| 357 | in the simulated CPU machinery in its time. |
| 358 | <p> |
| 359 | <li>The system call wrapper does |
| 360 | <code>VG_(first_and_last_secondaries_look_plausible)</code> after |
| 361 | every syscall; this is known to pick up bugs in the syscall |
| 362 | wrappers. Engaged by default. |
| 363 | <p> |
| 364 | <li>The main dispatch loop, in <code>VG_(dispatch)</code>, checks |
| 365 | that translations do not set <code>%ebp</code> to any value |
| 366 | different from <code>VG_EBP_DISPATCH_CHECKED</code> or |
| 367 | <code>& VG_(baseBlock)</code>. In effect this test is free, |
| 368 | and is permanently engaged. |
| 369 | <p> |
| 370 | <li>There are a couple of ifdefed-out consistency checks I |
| 371 | inserted whilst debugging the new register allocater, |
| 372 | <code>vg_do_register_allocation</code>. |
| 373 | </ul> |
| 374 | <p> |
| 375 | <li>I try to avoid techniques, algorithms, mechanisms, etc, for which |
| 376 | I can supply neither a convincing argument that they are correct, |
| 377 | nor sanity-check code which might pick up bugs in my |
| 378 | implementation. I don't always succeed in this, but I try. |
| 379 | Basically the idea is: avoid techniques which are, in practice, |
| 380 | unverifiable, in some sense. When doing anything, always have in |
| 381 | mind: "how can I verify that this is correct?" |
| 382 | </ul> |
| 383 | |
| 384 | <p> |
| 385 | Some more specific things are: |
| 386 | |
| 387 | <ul> |
| 388 | <li>Valgrind runs in the same namespace as the client, at least from |
| 389 | <code>ld.so</code>'s point of view, and it therefore absolutely |
| 390 | had better not export any symbol with a name which could clash |
| 391 | with that of the client or any of its libraries. Therefore, all |
| 392 | globally visible symbols exported from <code>valgrind.so</code> |
| 393 | are defined using the <code>VG_</code> CPP macro. As you'll see |
| 394 | from <code>vg_constants.h</code>, this appends some arbitrary |
| 395 | prefix to the symbol, in order that it be, we hope, globally |
| 396 | unique. Currently the prefix is <code>vgPlain_</code>. For |
| 397 | convenience there are also <code>VGM_</code>, <code>VGP_</code> |
| 398 | and <code>VGOFF_</code>. All locally defined symbols are declared |
| 399 | <code>static</code> and do not appear in the final shared object. |
| 400 | <p> |
| 401 | To check this, I periodically do |
| 402 | <code>nm valgrind.so | grep " T "</code>, |
| 403 | which shows you all the globally exported text symbols. |
| 404 | They should all have an approved prefix, except for those like |
| 405 | <code>malloc</code>, <code>free</code>, etc, which we deliberately |
| 406 | want to shadow and take precedence over the same names exported |
| 407 | from <code>glibc.so</code>, so that valgrind can intercept those |
| 408 | calls easily. Similarly, <code>nm valgrind.so | grep " D "</code> |
| 409 | allows you to find any rogue data-segment symbol names. |
| 410 | <p> |
| 411 | <li>Valgrind tries, and almost succeeds, in being completely |
| 412 | independent of all other shared objects, in particular of |
| 413 | <code>glibc.so</code>. For example, we have our own low-level |
| 414 | memory manager in <code>vg_malloc2.c</code>, which is a fairly |
| 415 | standard malloc/free scheme augmented with arenas, and |
| 416 | <code>vg_mylibc.c</code> exports reimplementations of various bits |
| 417 | and pieces you'd normally get from the C library. |
| 418 | <p> |
| 419 | Why all the hassle? Because imagine the potential chaos of both |
| 420 | the simulated and real CPUs executing in <code>glibc.so</code>. |
| 421 | It just seems simpler and cleaner to be completely self-contained, |
| 422 | so that only the simulated CPU visits <code>glibc.so</code>. In |
| 423 | practice it's not much hassle anyway. Also, valgrind starts up |
| 424 | before glibc has a chance to initialise itself, and who knows what |
| 425 | difficulties that could lead to. Finally, glibc has definitions |
| 426 | for some types, specifically <code>sigset_t</code>, which conflict |
| 427 | (are different from) the Linux kernel's idea of same. When |
| 428 | Valgrind wants to fiddle around with signal stuff, it wants to |
| 429 | use the kernel's definitions, not glibc's definitions. So it's |
| 430 | simplest just to keep glibc out of the picture entirely. |
| 431 | <p> |
| 432 | To find out which glibc symbols are used by Valgrind, reinstate |
| 433 | the link flags <code>-nostdlib -Wl,-no-undefined</code>. This |
| 434 | causes linking to fail, but will tell you what you depend on. |
| 435 | I have mostly, but not entirely, got rid of the glibc |
| 436 | dependencies; what remains is, IMO, fairly harmless. AFAIK the |
| 437 | current dependencies are: <code>memset</code>, |
| 438 | <code>memcmp</code>, <code>stat</code>, <code>system</code>, |
| 439 | <code>sbrk</code>, <code>setjmp</code> and <code>longjmp</code>. |
| 440 | |
| 441 | <p> |
| 442 | <li>Similarly, valgrind should not really import any headers other |
| 443 | than the Linux kernel headers, since it knows of no API other than |
| 444 | the kernel interface to talk to. At the moment this is really not |
| 445 | in a good state, and <code>vg_syscall_mem</code> imports, via |
| 446 | <code>vg_unsafe.h</code>, a significant number of C-library |
| 447 | headers so as to know the sizes of various structs passed across |
| 448 | the kernel boundary. This is of course completely bogus, since |
| 449 | there is no guarantee that the C library's definitions of these |
| 450 | structs matches those of the kernel. I have started to sort this |
| 451 | out using <code>vg_kerneliface.h</code>, into which I had intended |
| 452 | to copy all kernel definitions which valgrind could need, but this |
| 453 | has not gotten very far. At the moment it mostly contains |
| 454 | definitions for <code>sigset_t</code> and <code>struct |
| 455 | sigaction</code>, since the kernel's definition for these really |
| 456 | does clash with glibc's. I plan to use a <code>vki_</code> prefix |
| 457 | on all these types and constants, to denote the fact that they |
| 458 | pertain to <b>V</b>algrind's <b>K</b>ernel <b>I</b>nterface. |
| 459 | <p> |
| 460 | Another advantage of having a <code>vg_kerneliface.h</code> file |
| 461 | is that it makes it simpler to interface to a different kernel. |
| 462 | Once can, for example, easily imagine writing a new |
| 463 | <code>vg_kerneliface.h</code> for FreeBSD, or x86 NetBSD. |
| 464 | |
| 465 | </ul> |
| 466 | |
| 467 | <h3>Current limitations</h3> |
| 468 | |
| 469 | No threads. I think fixing this is close to a research-grade problem. |
| 470 | <p> |
| 471 | No MMX. Fixing this should be relatively easy, using the same giant |
| 472 | trick used for x86 FPU instructions. See below. |
| 473 | <p> |
| 474 | Support for weird (non-POSIX) signal stuff is patchy. Does anybody |
| 475 | care? |
| 476 | <p> |
| 477 | |
| 478 | |
| 479 | |
| 480 | |
| 481 | <hr width="100%"> |
| 482 | |
| 483 | <h2>The instrumenting JITter</h2> |
| 484 | |
| 485 | This really is the heart of the matter. We begin with various side |
| 486 | issues. |
| 487 | |
| 488 | <h3>Run-time storage, and the use of host registers</h3> |
| 489 | |
| 490 | Valgrind translates client (original) basic blocks into instrumented |
| 491 | basic blocks, which live in the translation cache TC, until either the |
| 492 | client finishes or the translations are ejected from TC to make room |
| 493 | for newer ones. |
| 494 | <p> |
| 495 | Since it generates x86 code in memory, Valgrind has complete control |
| 496 | of the use of registers in the translations. Now pay attention. I |
| 497 | shall say this only once, and it is important you understand this. In |
| 498 | what follows I will refer to registers in the host (real) cpu using |
| 499 | their standard names, <code>%eax</code>, <code>%edi</code>, etc. I |
| 500 | refer to registers in the simulated CPU by capitalising them: |
| 501 | <code>%EAX</code>, <code>%EDI</code>, etc. These two sets of |
| 502 | registers usually bear no direct relationship to each other; there is |
| 503 | no fixed mapping between them. This naming scheme is used fairly |
| 504 | consistently in the comments in the sources. |
| 505 | <p> |
| 506 | Host registers, once things are up and running, are used as follows: |
| 507 | <ul> |
| 508 | <li><code>%esp</code>, the real stack pointer, points |
| 509 | somewhere in Valgrind's private stack area, |
| 510 | <code>VG_(stack)</code> or, transiently, into its signal delivery |
| 511 | stack, <code>VG_(sigstack)</code>. |
| 512 | <p> |
| 513 | <li><code>%edi</code> is used as a temporary in code generation; it |
| 514 | is almost always dead, except when used for the <code>Left</code> |
| 515 | value-tag operations. |
| 516 | <p> |
| 517 | <li><code>%eax</code>, <code>%ebx</code>, <code>%ecx</code>, |
| 518 | <code>%edx</code> and <code>%esi</code> are available to |
| 519 | Valgrind's register allocator. They are dead (carry unimportant |
| 520 | values) in between translations, and are live only in |
| 521 | translations. The one exception to this is <code>%eax</code>, |
| 522 | which, as mentioned far above, has a special significance to the |
| 523 | dispatch loop <code>VG_(dispatch)</code>: when a translation |
| 524 | returns to the dispatch loop, <code>%eax</code> is expected to |
| 525 | contain the original-code-address of the next translation to run. |
| 526 | The register allocator is so good at minimising spill code that |
| 527 | using five regs and not having to save/restore <code>%edi</code> |
| 528 | actually gives better code than allocating to <code>%edi</code> |
| 529 | as well, but then having to push/pop it around special uses. |
| 530 | <p> |
| 531 | <li><code>%ebp</code> points permanently at |
| 532 | <code>VG_(baseBlock)</code>. Valgrind's translations are |
| 533 | position-independent, partly because this is convenient, but also |
| 534 | because translations get moved around in TC as part of the LRUing |
| 535 | activity. <b>All</b> static entities which need to be referred to |
| 536 | from generated code, whether data or helper functions, are stored |
| 537 | starting at <code>VG_(baseBlock)</code> and are therefore reached |
| 538 | by indexing from <code>%ebp</code>. There is but one exception, |
| 539 | which is that by placing the value |
| 540 | <code>VG_EBP_DISPATCH_CHECKED</code> |
| 541 | in <code>%ebp</code> just before a return to the dispatcher, |
| 542 | the dispatcher is informed that the next address to run, |
| 543 | in <code>%eax</code>, requires special treatment. |
| 544 | <p> |
| 545 | <li>The real machine's FPU state is pretty much unimportant, for |
| 546 | reasons which will become obvious. Ditto its <code>%eflags</code> |
| 547 | register. |
| 548 | </ul> |
| 549 | |
| 550 | <p> |
| 551 | The state of the simulated CPU is stored in memory, in |
| 552 | <code>VG_(baseBlock)</code>, which is a block of 200 words IIRC. |
| 553 | Recall that <code>%ebp</code> points permanently at the start of this |
| 554 | block. Function <code>vg_init_baseBlock</code> decides what the |
| 555 | offsets of various entities in <code>VG_(baseBlock)</code> are to be, |
| 556 | and allocates word offsets for them. The code generator then emits |
| 557 | <code>%ebp</code> relative addresses to get at those things. The |
| 558 | sequence in which entities are allocated has been carefully chosen so |
| 559 | that the 32 most popular entities come first, because this means 8-bit |
| 560 | offsets can be used in the generated code. |
| 561 | |
| 562 | <p> |
| 563 | If I was clever, I could make <code>%ebp</code> point 32 words along |
| 564 | <code>VG_(baseBlock)</code>, so that I'd have another 32 words of |
| 565 | short-form offsets available, but that's just complicated, and it's |
| 566 | not important -- the first 32 words take 99% (or whatever) of the |
| 567 | traffic. |
| 568 | |
| 569 | <p> |
| 570 | Currently, the sequence of stuff in <code>VG_(baseBlock)</code> is as |
| 571 | follows: |
| 572 | <ul> |
| 573 | <li>9 words, holding the simulated integer registers, |
| 574 | <code>%EAX</code> .. <code>%EDI</code>, and the simulated flags, |
| 575 | <code>%EFLAGS</code>. |
| 576 | <p> |
| 577 | <li>Another 9 words, holding the V bit "shadows" for the above 9 regs. |
| 578 | <p> |
| 579 | <li>The <b>addresses</b> of various helper routines called from |
| 580 | generated code: |
| 581 | <code>VG_(helper_value_check4_fail)</code>, |
| 582 | <code>VG_(helper_value_check0_fail)</code>, |
| 583 | which register V-check failures, |
| 584 | <code>VG_(helperc_STOREV4)</code>, |
| 585 | <code>VG_(helperc_STOREV1)</code>, |
| 586 | <code>VG_(helperc_LOADV4)</code>, |
| 587 | <code>VG_(helperc_LOADV1)</code>, |
| 588 | which do stores and loads of V bits to/from the |
| 589 | sparse array which keeps track of V bits in memory, |
| 590 | and |
| 591 | <code>VGM_(handle_esp_assignment)</code>, which messes with |
| 592 | memory addressibility resulting from changes in <code>%ESP</code>. |
| 593 | <p> |
| 594 | <li>The simulated <code>%EIP</code>. |
| 595 | <p> |
| 596 | <li>24 spill words, for when the register allocator can't make it work |
| 597 | with 5 measly registers. |
| 598 | <p> |
| 599 | <li>Addresses of helpers <code>VG_(helperc_STOREV2)</code>, |
| 600 | <code>VG_(helperc_LOADV2)</code>. These are here because 2-byte |
| 601 | loads and stores are relatively rare, so are placed above the |
| 602 | magic 32-word offset boundary. |
| 603 | <p> |
| 604 | <li>For similar reasons, addresses of helper functions |
| 605 | <code>VGM_(fpu_write_check)</code> and |
| 606 | <code>VGM_(fpu_read_check)</code>, which handle the A/V maps |
| 607 | testing and changes required by FPU writes/reads. |
| 608 | <p> |
| 609 | <li>Some other boring helper addresses: |
| 610 | <code>VG_(helper_value_check2_fail)</code> and |
| 611 | <code>VG_(helper_value_check1_fail)</code>. These are probably |
| 612 | never emitted now, and should be removed. |
| 613 | <p> |
| 614 | <li>The entire state of the simulated FPU, which I believe to be |
| 615 | 108 bytes long. |
| 616 | <p> |
| 617 | <li>Finally, the addresses of various other helper functions in |
| 618 | <code>vg_helpers.S</code>, which deal with rare situations which |
| 619 | are tedious or difficult to generate code in-line for. |
| 620 | </ul> |
| 621 | |
| 622 | <p> |
| 623 | As a general rule, the simulated machine's state lives permanently in |
| 624 | memory at <code>VG_(baseBlock)</code>. However, the JITter does some |
| 625 | optimisations which allow the simulated integer registers to be |
| 626 | cached in real registers over multiple simulated instructions within |
| 627 | the same basic block. These are always flushed back into memory at |
| 628 | the end of every basic block, so that the in-memory state is |
| 629 | up-to-date between basic blocks. (This flushing is implied by the |
| 630 | statement above that the real machine's allocatable registers are |
| 631 | dead in between simulated blocks). |
| 632 | |
| 633 | |
| 634 | <h3>Startup, shutdown, and system calls</h3> |
| 635 | |
| 636 | Getting into of Valgrind (<code>VG_(startup)</code>, called from |
| 637 | <code>valgrind.so</code>'s initialisation section), really means |
| 638 | copying the real CPU's state into <code>VG_(baseBlock)</code>, and |
| 639 | then installing our own stack pointer, etc, into the real CPU, and |
| 640 | then starting up the JITter. Exiting valgrind involves copying the |
| 641 | simulated state back to the real state. |
| 642 | |
| 643 | <p> |
| 644 | Unfortunately, there's a complication at startup time. Problem is |
| 645 | that at the point where we need to take a snapshot of the real CPU's |
| 646 | state, the offsets in <code>VG_(baseBlock)</code> are not set up yet, |
| 647 | because to do so would involve disrupting the real machine's state |
| 648 | significantly. The way round this is to dump the real machine's state |
| 649 | into a temporary, static block of memory, |
| 650 | <code>VG_(m_state_static)</code>. We can then set up the |
| 651 | <code>VG_(baseBlock)</code> offsets at our leisure, and copy into it |
| 652 | from <code>VG_(m_state_static)</code> at some convenient later time. |
| 653 | This copying is done by |
| 654 | <code>VG_(copy_m_state_static_to_baseBlock)</code>. |
| 655 | |
| 656 | <p> |
| 657 | On exit, the inverse transformation is (rather unnecessarily) used: |
| 658 | stuff in <code>VG_(baseBlock)</code> is copied to |
| 659 | <code>VG_(m_state_static)</code>, and the assembly stub then copies |
| 660 | from <code>VG_(m_state_static)</code> into the real machine registers. |
| 661 | |
| 662 | <p> |
| 663 | Doing system calls on behalf of the client (<code>vg_syscall.S</code>) |
| 664 | is something of a half-way house. We have to make the world look |
| 665 | sufficiently like that which the client would normally have to make |
| 666 | the syscall actually work properly, but we can't afford to lose |
| 667 | control. So the trick is to copy all of the client's state, <b>except |
| 668 | its program counter</b>, into the real CPU, do the system call, and |
| 669 | copy the state back out. Note that the client's state includes its |
| 670 | stack pointer register, so one effect of this partial restoration is |
| 671 | to cause the system call to be run on the client's stack, as it should |
| 672 | be. |
| 673 | |
| 674 | <p> |
| 675 | As ever there are complications. We have to save some of our own state |
| 676 | somewhere when restoring the client's state into the CPU, so that we |
| 677 | can keep going sensibly afterwards. In fact the only thing which is |
| 678 | important is our own stack pointer, but for paranoia reasons I save |
| 679 | and restore our own FPU state as well, even though that's probably |
| 680 | pointless. |
| 681 | |
| 682 | <p> |
| 683 | The complication on the above complication is, that for horrible |
| 684 | reasons to do with signals, we may have to handle a second client |
| 685 | system call whilst the client is blocked inside some other system |
| 686 | call (unbelievable!). That means there's two sets of places to |
| 687 | dump Valgrind's stack pointer and FPU state across the syscall, |
| 688 | and we decide which to use by consulting |
| 689 | <code>VG_(syscall_depth)</code>, which is in turn maintained by |
| 690 | <code>VG_(wrap_syscall)</code>. |
| 691 | |
| 692 | |
| 693 | |
| 694 | <h3>Introduction to UCode</h3> |
| 695 | |
| 696 | UCode lies at the heart of the x86-to-x86 JITter. The basic premise |
| 697 | is that dealing the the x86 instruction set head-on is just too darn |
| 698 | complicated, so we do the traditional compiler-writer's trick and |
| 699 | translate it into a simpler, easier-to-deal-with form. |
| 700 | |
| 701 | <p> |
| 702 | In normal operation, translation proceeds through six stages, |
| 703 | coordinated by <code>VG_(translate)</code>: |
| 704 | <ol> |
| 705 | <li>Parsing of an x86 basic block into a sequence of UCode |
| 706 | instructions (<code>VG_(disBB)</code>). |
| 707 | <p> |
| 708 | <li>UCode optimisation (<code>vg_improve</code>), with the aim of |
| 709 | caching simulated registers in real registers over multiple |
| 710 | simulated instructions, and removing redundant simulated |
| 711 | <code>%EFLAGS</code> saving/restoring. |
| 712 | <p> |
| 713 | <li>UCode instrumentation (<code>vg_instrument</code>), which adds |
| 714 | value and address checking code. |
| 715 | <p> |
| 716 | <li>Post-instrumentation cleanup (<code>vg_cleanup</code>), removing |
| 717 | redundant value-check computations. |
| 718 | <p> |
| 719 | <li>Register allocation (<code>vg_do_register_allocation</code>), |
| 720 | which, note, is done on UCode. |
| 721 | <p> |
| 722 | <li>Emission of final instrumented x86 code |
| 723 | (<code>VG_(emit_code)</code>). |
| 724 | </ol> |
| 725 | |
| 726 | <p> |
| 727 | Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode |
| 728 | transformation passes, all on straight-line blocks of UCode (type |
| 729 | <code>UCodeBlock</code>). Steps 2 and 4 are optimisation passes and |
| 730 | can be disabled for debugging purposes, with |
| 731 | <code>--optimise=no</code> and <code>--cleanup=no</code> respectively. |
| 732 | |
| 733 | <p> |
| 734 | Valgrind can also run in a no-instrumentation mode, given |
| 735 | <code>--instrument=no</code>. This is useful for debugging the JITter |
| 736 | quickly without having to deal with the complexity of the |
| 737 | instrumentation mechanism too. In this mode, steps 3 and 4 are |
| 738 | omitted. |
| 739 | |
| 740 | <p> |
| 741 | These flags combine, so that <code>--instrument=no</code> together with |
| 742 | <code>--optimise=no</code> means only steps 1, 5 and 6 are used. |
| 743 | <code>--single-step=yes</code> causes each x86 instruction to be |
| 744 | treated as a single basic block. The translations are terrible but |
| 745 | this is sometimes instructive. |
| 746 | |
| 747 | <p> |
| 748 | The <code>--stop-after=N</code> flag switches back to the real CPU |
| 749 | after <code>N</code> basic blocks. It also re-JITs the final basic |
| 750 | block executed and prints the debugging info resulting, so this |
| 751 | gives you a way to get a quick snapshot of how a basic block looks as |
| 752 | it passes through the six stages mentioned above. If you want to |
| 753 | see full information for every block translated (probably not, but |
| 754 | still ...) find, in <code>VG_(translate)</code>, the lines |
| 755 | <br><code> dis = True;</code> |
| 756 | <br><code> dis = debugging_translation;</code> |
| 757 | <br> |
| 758 | and comment out the second line. This will spew out debugging |
| 759 | junk faster than you can possibly imagine. |
| 760 | |
| 761 | |
| 762 | |
| 763 | <h3>UCode operand tags: type <code>Tag</code></h3> |
| 764 | |
| 765 | UCode is, more or less, a simple two-address RISC-like code. In |
| 766 | keeping with the x86 AT&T assembly syntax, generally speaking the |
| 767 | first operand is the source operand, and the second is the destination |
| 768 | operand, which is modified when the uinstr is notionally executed. |
| 769 | |
| 770 | <p> |
| 771 | UCode instructions have up to three operand fields, each of which has |
| 772 | a corresponding <code>Tag</code> describing it. Possible values for |
| 773 | the tag are: |
| 774 | |
| 775 | <ul> |
| 776 | <li><code>NoValue</code>: indicates that the field is not in use. |
| 777 | <p> |
| 778 | <li><code>Lit16</code>: the field contains a 16-bit literal. |
| 779 | <p> |
| 780 | <li><code>Literal</code>: the field denotes a 32-bit literal, whose |
| 781 | value is stored in the <code>lit32</code> field of the uinstr |
| 782 | itself. Since there is only one <code>lit32</code> for the whole |
| 783 | uinstr, only one operand field may contain this tag. |
| 784 | <p> |
| 785 | <li><code>SpillNo</code>: the field contains a spill slot number, in |
| 786 | the range 0 to 23 inclusive, denoting one of the spill slots |
| 787 | contained inside <code>VG_(baseBlock)</code>. Such tags only |
| 788 | exist after register allocation. |
| 789 | <p> |
| 790 | <li><code>RealReg</code>: the field contains a number in the range 0 |
| 791 | to 7 denoting an integer x86 ("real") register on the host. The |
| 792 | number is the Intel encoding for integer registers. Such tags |
| 793 | only exist after register allocation. |
| 794 | <p> |
| 795 | <li><code>ArchReg</code>: the field contains a number in the range 0 |
| 796 | to 7 denoting an integer x86 register on the simulated CPU. In |
| 797 | reality this means a reference to one of the first 8 words of |
| 798 | <code>VG_(baseBlock)</code>. Such tags can exist at any point in |
| 799 | the translation process. |
| 800 | <p> |
| 801 | <li>Last, but not least, <code>TempReg</code>. The field contains the |
| 802 | number of one of an infinite set of virtual (integer) |
| 803 | registers. <code>TempReg</code>s are used everywhere throughout |
| 804 | the translation process; you can have as many as you want. The |
| 805 | register allocator maps as many as it can into |
| 806 | <code>RealReg</code>s and turns the rest into |
| 807 | <code>SpillNo</code>s, so <code>TempReg</code>s should not exist |
| 808 | after the register allocation phase. |
| 809 | <p> |
| 810 | <code>TempReg</code>s are always 32 bits long, even if the data |
| 811 | they hold is logically shorter. In that case the upper unused |
| 812 | bits are required, and, I think, generally assumed, to be zero. |
| 813 | <code>TempReg</code>s holding V bits for quantities shorter than |
| 814 | 32 bits are expected to have ones in the unused places, since a |
| 815 | one denotes "undefined". |
| 816 | </ul> |
| 817 | |
| 818 | |
| 819 | <h3>UCode instructions: type <code>UInstr</code></h3> |
| 820 | |
| 821 | <p> |
| 822 | UCode was carefully designed to make it possible to do register |
| 823 | allocation on UCode and then translate the result into x86 code |
| 824 | without needing any extra registers ... well, that was the original |
| 825 | plan, anyway. Things have gotten a little more complicated since |
| 826 | then. In what follows, UCode instructions are referred to as uinstrs, |
| 827 | to distinguish them from x86 instructions. Uinstrs of course have |
| 828 | uopcodes which are (naturally) different from x86 opcodes. |
| 829 | |
| 830 | <p> |
| 831 | A uinstr (type <code>UInstr</code>) contains |
| 832 | various fields, not all of which are used by any one uopcode: |
| 833 | <ul> |
| 834 | <li>Three 16-bit operand fields, <code>val1</code>, <code>val2</code> |
| 835 | and <code>val3</code>. |
| 836 | <p> |
| 837 | <li>Three tag fields, <code>tag1</code>, <code>tag2</code> |
| 838 | and <code>tag3</code>. Each of these has a value of type |
| 839 | <code>Tag</code>, |
| 840 | and they describe what the <code>val1</code>, <code>val2</code> |
| 841 | and <code>val3</code> fields contain. |
| 842 | <p> |
| 843 | <li>A 32-bit literal field. |
| 844 | <p> |
| 845 | <li>Two <code>FlagSet</code>s, specifying which x86 condition codes are |
| 846 | read and written by the uinstr. |
| 847 | <p> |
| 848 | <li>An opcode byte, containing a value of type <code>Opcode</code>. |
| 849 | <p> |
| 850 | <li>A size field, indicating the data transfer size (1/2/4/8/10) in |
| 851 | cases where this makes sense, or zero otherwise. |
| 852 | <p> |
| 853 | <li>A condition-code field, which, for jumps, holds a |
| 854 | value of type <code>Condcode</code>, indicating the condition |
| 855 | which applies. The encoding is as it is in the x86 insn stream, |
| 856 | except we add a 17th value <code>CondAlways</code> to indicate |
| 857 | an unconditional transfer. |
| 858 | <p> |
| 859 | <li>Various 1-bit flags, indicating whether this insn pertains to an |
| 860 | x86 CALL or RET instruction, whether a widening is signed or not, |
| 861 | etc. |
| 862 | </ul> |
| 863 | |
| 864 | <p> |
| 865 | UOpcodes (type <code>Opcode</code>) are divided into two groups: those |
| 866 | necessary merely to express the functionality of the x86 code, and |
| 867 | extra uopcodes needed to express the instrumentation. The former |
| 868 | group contains: |
| 869 | <ul> |
| 870 | <li><code>GET</code> and <code>PUT</code>, which move values from the |
| 871 | simulated CPU's integer registers (<code>ArchReg</code>s) into |
| 872 | <code>TempReg</code>s, and back. <code>GETF</code> and |
| 873 | <code>PUTF</code> do the corresponding thing for the simulated |
| 874 | <code>%EFLAGS</code>. There are no corresponding insns for the |
| 875 | FPU register stack, since we don't explicitly simulate its |
| 876 | registers. |
| 877 | <p> |
| 878 | <li><code>LOAD</code> and <code>STORE</code>, which, in RISC-like |
| 879 | fashion, are the only uinstrs able to interact with memory. |
| 880 | <p> |
| 881 | <li><code>MOV</code> and <code>CMOV</code> allow unconditional and |
| 882 | conditional moves of values between <code>TempReg</code>s. |
| 883 | <p> |
| 884 | <li>ALU operations. Again in RISC-like fashion, these only operate on |
| 885 | <code>TempReg</code>s (before reg-alloc) or <code>RealReg</code>s |
| 886 | (after reg-alloc). These are: <code>ADD</code>, <code>ADC</code>, |
| 887 | <code>AND</code>, <code>OR</code>, <code>XOR</code>, |
| 888 | <code>SUB</code>, <code>SBB</code>, <code>SHL</code>, |
| 889 | <code>SHR</code>, <code>SAR</code>, <code>ROL</code>, |
| 890 | <code>ROR</code>, <code>RCL</code>, <code>RCR</code>, |
| 891 | <code>NOT</code>, <code>NEG</code>, <code>INC</code>, |
| 892 | <code>DEC</code>, <code>BSWAP</code>, <code>CC2VAL</code> and |
| 893 | <code>WIDEN</code>. <code>WIDEN</code> does signed or unsigned |
| 894 | value widening. <code>CC2VAL</code> is used to convert condition |
| 895 | codes into a value, zero or one. The rest are obvious. |
| 896 | <p> |
| 897 | To allow for more efficient code generation, we bend slightly the |
| 898 | restriction at the start of the previous para: for |
| 899 | <code>ADD</code>, <code>ADC</code>, <code>XOR</code>, |
| 900 | <code>SUB</code> and <code>SBB</code>, we allow the first (source) |
| 901 | operand to also be an <code>ArchReg</code>, that is, one of the |
| 902 | simulated machine's registers. Also, many of these ALU ops allow |
| 903 | the source operand to be a literal. See |
| 904 | <code>VG_(saneUInstr)</code> for the final word on the allowable |
| 905 | forms of uinstrs. |
| 906 | <p> |
| 907 | <li><code>LEA1</code> and <code>LEA2</code> are not strictly |
| 908 | necessary, but allow faciliate better translations. They |
| 909 | record the fancy x86 addressing modes in a direct way, which |
| 910 | allows those amodes to be emitted back into the final |
| 911 | instruction stream more or less verbatim. |
| 912 | <p> |
| 913 | <li><code>CALLM</code> calls a machine-code helper, one of the methods |
| 914 | whose address is stored at some <code>VG_(baseBlock)</code> |
| 915 | offset. <code>PUSH</code> and <code>POP</code> move values |
| 916 | to/from <code>TempReg</code> to the real (Valgrind's) stack, and |
| 917 | <code>CLEAR</code> removes values from the stack. |
| 918 | <code>CALLM_S</code> and <code>CALLM_E</code> delimit the |
| 919 | boundaries of call setups and clearings, for the benefit of the |
| 920 | instrumentation passes. Getting this right is critical, and so |
| 921 | <code>VG_(saneUCodeBlock)</code> makes various checks on the use |
| 922 | of these uopcodes. |
| 923 | <p> |
| 924 | It is important to understand that these uopcodes have nothing to |
| 925 | do with the x86 <code>call</code>, <code>return,</code> |
| 926 | <code>push</code> or <code>pop</code> instructions, and are not |
| 927 | used to implement them. Those guys turn into combinations of |
| 928 | <code>GET</code>, <code>PUT</code>, <code>LOAD</code>, |
| 929 | <code>STORE</code>, <code>ADD</code>, <code>SUB</code>, and |
| 930 | <code>JMP</code>. What these uopcodes support is calling of |
| 931 | helper functions such as <code>VG_(helper_imul_32_64)</code>, |
| 932 | which do stuff which is too difficult or tedious to emit inline. |
| 933 | <p> |
| 934 | <li><code>FPU</code>, <code>FPU_R</code> and <code>FPU_W</code>. |
| 935 | Valgrind doesn't attempt to simulate the internal state of the |
| 936 | FPU at all. Consequently it only needs to be able to distinguish |
| 937 | FPU ops which read and write memory from those that don't, and |
| 938 | for those which do, it needs to know the effective address and |
| 939 | data transfer size. This is made easier because the x86 FP |
| 940 | instruction encoding is very regular, basically consisting of |
| 941 | 16 bits for a non-memory FPU insn and 11 (IIRC) bits + an address mode |
| 942 | for a memory FPU insn. So our <code>FPU</code> uinstr carries |
| 943 | the 16 bits in its <code>val1</code> field. And |
| 944 | <code>FPU_R</code> and <code>FPU_W</code> carry 11 bits in that |
| 945 | field, together with the identity of a <code>TempReg</code> or |
| 946 | (later) <code>RealReg</code> which contains the address. |
| 947 | <p> |
| 948 | <li><code>JIFZ</code> is unique, in that it allows a control-flow |
| 949 | transfer which is not deemed to end a basic block. It causes a |
| 950 | jump to a literal (original) address if the specified argument |
| 951 | is zero. |
| 952 | <p> |
| 953 | <li>Finally, <code>INCEIP</code> advances the simulated |
| 954 | <code>%EIP</code> by the specified literal amount. This supports |
| 955 | lazy <code>%EIP</code> updating, as described below. |
| 956 | </ul> |
| 957 | |
| 958 | <p> |
| 959 | Stages 1 and 2 of the 6-stage translation process mentioned above |
| 960 | deal purely with these uopcodes, and no others. They are |
| 961 | sufficient to express pretty much all the x86 32-bit protected-mode |
| 962 | instruction set, at |
| 963 | least everything understood by a pre-MMX original Pentium (P54C). |
| 964 | |
| 965 | <p> |
| 966 | Stages 3, 4, 5 and 6 also deal with the following extra |
| 967 | "instrumentation" uopcodes. They are used to express all the |
| 968 | definedness-tracking and -checking machinery which valgrind does. In |
| 969 | later sections we show how to create checking code for each of the |
| 970 | uopcodes above. Note that these instrumentation uopcodes, although |
| 971 | some appearing complicated, have been carefully chosen so that |
| 972 | efficient x86 code can be generated for them. GNU superopt v2.5 did a |
| 973 | great job helping out here. Anyways, the uopcodes are as follows: |
| 974 | |
| 975 | <ul> |
| 976 | <li><code>GETV</code> and <code>PUTV</code> are analogues to |
| 977 | <code>GET</code> and <code>PUT</code> above. They are identical |
| 978 | except that they move the V bits for the specified values back and |
| 979 | forth to <code>TempRegs</code>, rather than moving the values |
| 980 | themselves. |
| 981 | <p> |
| 982 | <li>Similarly, <code>LOADV</code> and <code>STOREV</code> read and |
| 983 | write V bits from the synthesised shadow memory that Valgrind |
| 984 | maintains. In fact they do more than that, since they also do |
| 985 | address-validity checks, and emit complaints if the read/written |
| 986 | addresses are unaddressible. |
| 987 | <p> |
| 988 | <li><code>TESTV</code>, whose parameters are a <code>TempReg</code> |
| 989 | and a size, tests the V bits in the <code>TempReg</code>, at the |
| 990 | specified operation size (0/1/2/4 byte) and emits an error if any |
| 991 | of them indicate undefinedness. This is the only uopcode capable |
| 992 | of doing such tests. |
| 993 | <p> |
| 994 | <li><code>SETV</code>, whose parameters are also <code>TempReg</code> |
| 995 | and a size, makes the V bits in the <code>TempReg</code> indicated |
| 996 | definedness, at the specified operation size. This is usually |
| 997 | used to generate the correct V bits for a literal value, which is |
| 998 | of course fully defined. |
| 999 | <p> |
| 1000 | <li><code>GETVF</code> and <code>PUTVF</code> are analogues to |
| 1001 | <code>GETF</code> and <code>PUTF</code>. They move the single V |
| 1002 | bit used to model definedness of <code>%EFLAGS</code> between its |
| 1003 | home in <code>VG_(baseBlock)</code> and the specified |
| 1004 | <code>TempReg</code>. |
| 1005 | <p> |
| 1006 | <li><code>TAG1</code> denotes one of a family of unary operations on |
| 1007 | <code>TempReg</code>s containing V bits. Similarly, |
| 1008 | <code>TAG2</code> denotes one in a family of binary operations on |
| 1009 | V bits. |
| 1010 | </ul> |
| 1011 | |
| 1012 | <p> |
| 1013 | These 10 uopcodes are sufficient to express Valgrind's entire |
| 1014 | definedness-checking semantics. In fact most of the interesting magic |
| 1015 | is done by the <code>TAG1</code> and <code>TAG2</code> |
| 1016 | suboperations. |
| 1017 | |
| 1018 | <p> |
| 1019 | First, however, I need to explain about V-vector operation sizes. |
| 1020 | There are 4 sizes: 1, 2 and 4, which operate on groups of 8, 16 and 32 |
| 1021 | V bits at a time, supporting the usual 1, 2 and 4 byte x86 operations. |
| 1022 | However there is also the mysterious size 0, which really means a |
| 1023 | single V bit. Single V bits are used in various circumstances; in |
| 1024 | particular, the definedness of <code>%EFLAGS</code> is modelled with a |
| 1025 | single V bit. Now might be a good time to also point out that for |
| 1026 | V bits, 1 means "undefined" and 0 means "defined". Similarly, for A |
| 1027 | bits, 1 means "invalid address" and 0 means "valid address". This |
| 1028 | seems counterintuitive (and so it is), but testing against zero on |
| 1029 | x86s saves instructions compared to testing against all 1s, because |
| 1030 | many ALU operations set the Z flag for free, so to speak. |
| 1031 | |
| 1032 | <p> |
| 1033 | With that in mind, the tag ops are: |
| 1034 | |
| 1035 | <ul> |
| 1036 | <li><b>(UNARY) Pessimising casts</b>: <code>VgT_PCast40</code>, |
| 1037 | <code>VgT_PCast20</code>, <code>VgT_PCast10</code>, |
| 1038 | <code>VgT_PCast01</code>, <code>VgT_PCast02</code> and |
| 1039 | <code>VgT_PCast04</code>. A "pessimising cast" takes a V-bit |
| 1040 | vector at one size, and creates a new one at another size, |
| 1041 | pessimised in the sense that if any of the bits in the source |
| 1042 | vector indicate undefinedness, then all the bits in the result |
| 1043 | indicate undefinedness. In this case the casts are all to or from |
| 1044 | a single V bit, so for example <code>VgT_PCast40</code> is a |
| 1045 | pessimising cast from 32 bits to 1, whereas |
| 1046 | <code>VgT_PCast04</code> simply copies the single source V bit |
| 1047 | into all 32 bit positions in the result. Surprisingly, these ops |
| 1048 | can all be implemented very efficiently. |
| 1049 | <p> |
| 1050 | There are also the pessimising casts <code>VgT_PCast14</code>, |
| 1051 | from 8 bits to 32, <code>VgT_PCast12</code>, from 8 bits to 16, |
| 1052 | and <code>VgT_PCast11</code>, from 8 bits to 8. This last one |
| 1053 | seems nonsensical, but in fact it isn't a no-op because, as |
| 1054 | mentioned above, any undefined (1) bits in the source infect the |
| 1055 | entire result. |
| 1056 | <p> |
| 1057 | <li><b>(UNARY) Propagating undefinedness upwards in a word</b>: |
| 1058 | <code>VgT_Left4</code>, <code>VgT_Left2</code> and |
| 1059 | <code>VgT_Left1</code>. These are used to simulate the worst-case |
| 1060 | effects of carry propagation in adds and subtracts. They return a |
| 1061 | V vector identical to the original, except that if the original |
| 1062 | contained any undefined bits, then it and all bits above it are |
| 1063 | marked as undefined too. Hence the Left bit in the names. |
| 1064 | <p> |
| 1065 | <li><b>(UNARY) Signed and unsigned value widening</b>: |
| 1066 | <code>VgT_SWiden14</code>, <code>VgT_SWiden24</code>, |
| 1067 | <code>VgT_SWiden12</code>, <code>VgT_ZWiden14</code>, |
| 1068 | <code>VgT_ZWiden24</code> and <code>VgT_ZWiden12</code>. These |
| 1069 | mimic the definedness effects of standard signed and unsigned |
| 1070 | integer widening. Unsigned widening creates zero bits in the new |
| 1071 | positions, so <code>VgT_ZWiden*</code> accordingly park mark |
| 1072 | those parts of their argument as defined. Signed widening copies |
| 1073 | the sign bit into the new positions, so <code>VgT_SWiden*</code> |
| 1074 | copies the definedness of the sign bit into the new positions. |
| 1075 | Because 1 means undefined and 0 means defined, these operations |
| 1076 | can (fascinatingly) be done by the same operations which they |
| 1077 | mimic. Go figure. |
| 1078 | <p> |
| 1079 | <li><b>(BINARY) Undefined-if-either-Undefined, |
| 1080 | Defined-if-either-Defined</b>: <code>VgT_UifU4</code>, |
| 1081 | <code>VgT_UifU2</code>, <code>VgT_UifU1</code>, |
| 1082 | <code>VgT_UifU0</code>, <code>VgT_DifD4</code>, |
| 1083 | <code>VgT_DifD2</code>, <code>VgT_DifD1</code>. These do simple |
| 1084 | bitwise operations on pairs of V-bit vectors, with |
| 1085 | <code>UifU</code> giving undefined if either arg bit is |
| 1086 | undefined, and <code>DifD</code> giving defined if either arg bit |
| 1087 | is defined. Abstract interpretation junkies, if any make it this |
| 1088 | far, may like to think of them as meets and joins (or is it joins |
| 1089 | and meets) in the definedness lattices. |
| 1090 | <p> |
| 1091 | <li><b>(BINARY; one value, one V bits) Generate argument improvement |
| 1092 | terms for AND and OR</b>: <code>VgT_ImproveAND4_TQ</code>, |
| 1093 | <code>VgT_ImproveAND2_TQ</code>, <code>VgT_ImproveAND1_TQ</code>, |
| 1094 | <code>VgT_ImproveOR4_TQ</code>, <code>VgT_ImproveOR2_TQ</code>, |
| 1095 | <code>VgT_ImproveOR1_TQ</code>. These help out with AND and OR |
| 1096 | operations. AND and OR have the inconvenient property that the |
| 1097 | definedness of the result depends on the actual values of the |
| 1098 | arguments as well as their definedness. At the bit level: |
| 1099 | <br><code>1 AND undefined = undefined</code>, but |
| 1100 | <br><code>0 AND undefined = 0</code>, and similarly |
| 1101 | <br><code>0 OR undefined = undefined</code>, but |
| 1102 | <br><code>1 OR undefined = 1</code>. |
| 1103 | <br> |
| 1104 | <p> |
| 1105 | It turns out that gcc (quite legitimately) generates code which |
| 1106 | relies on this fact, so we have to model it properly in order to |
| 1107 | avoid flooding users with spurious value errors. The ultimate |
| 1108 | definedness result of AND and OR is calculated using |
| 1109 | <code>UifU</code> on the definedness of the arguments, but we |
| 1110 | also <code>DifD</code> in some "improvement" terms which |
| 1111 | take into account the above phenomena. |
| 1112 | <p> |
| 1113 | <code>ImproveAND</code> takes as its first argument the actual |
| 1114 | value of an argument to AND (the T) and the definedness of that |
| 1115 | argument (the Q), and returns a V-bit vector which is defined (0) |
| 1116 | for bits which have value 0 and are defined; this, when |
| 1117 | <code>DifD</code> into the final result causes those bits to be |
| 1118 | defined even if the corresponding bit in the other argument is undefined. |
| 1119 | <p> |
| 1120 | The <code>ImproveOR</code> ops do the dual thing for OR |
| 1121 | arguments. Note that XOR does not have this property that one |
| 1122 | argument can make the other irrelevant, so there is no need for |
| 1123 | such complexity for XOR. |
| 1124 | </ul> |
| 1125 | |
| 1126 | <p> |
| 1127 | That's all the tag ops. If you stare at this long enough, and then |
| 1128 | run Valgrind and stare at the pre- and post-instrumented ucode, it |
| 1129 | should be fairly obvious how the instrumentation machinery hangs |
| 1130 | together. |
| 1131 | |
| 1132 | <p> |
| 1133 | One point, if you do this: in order to make it easy to differentiate |
| 1134 | <code>TempReg</code>s carrying values from <code>TempReg</code>s |
| 1135 | carrying V bit vectors, Valgrind prints the former as (for example) |
| 1136 | <code>t28</code> and the latter as <code>q28</code>; the fact that |
| 1137 | they carry the same number serves to indicate their relationship. |
| 1138 | This is purely for the convenience of the human reader; the register |
| 1139 | allocator and code generator don't regard them as different. |
| 1140 | |
| 1141 | |
| 1142 | <h3>Translation into UCode</h3> |
| 1143 | |
| 1144 | <code>VG_(disBB)</code> allocates a new <code>UCodeBlock</code> and |
| 1145 | then uses <code>disInstr</code> to translate x86 instructions one at a |
| 1146 | time into UCode, dumping the result in the <code>UCodeBlock</code>. |
| 1147 | This goes on until a control-flow transfer instruction is encountered. |
| 1148 | |
| 1149 | <p> |
| 1150 | Despite the large size of <code>vg_to_ucode.c</code>, this translation |
| 1151 | is really very simple. Each x86 instruction is translated entirely |
| 1152 | independently of its neighbours, merrily allocating new |
| 1153 | <code>TempReg</code>s as it goes. The idea is to have a simple |
| 1154 | translator -- in reality, no more than a macro-expander -- and the -- |
| 1155 | resulting bad UCode translation is cleaned up by the UCode |
| 1156 | optimisation phase which follows. To give you an idea of some x86 |
| 1157 | instructions and their translations (this is a complete basic block, |
| 1158 | as Valgrind sees it): |
| 1159 | <pre> |
| 1160 | 0x40435A50: incl %edx |
| 1161 | |
| 1162 | 0: GETL %EDX, t0 |
| 1163 | 1: INCL t0 (-wOSZAP) |
| 1164 | 2: PUTL t0, %EDX |
| 1165 | |
| 1166 | 0x40435A51: movsbl (%edx),%eax |
| 1167 | |
| 1168 | 3: GETL %EDX, t2 |
| 1169 | 4: LDB (t2), t2 |
| 1170 | 5: WIDENL_Bs t2 |
| 1171 | 6: PUTL t2, %EAX |
| 1172 | |
| 1173 | 0x40435A54: testb $0x20, 1(%ecx,%eax,2) |
| 1174 | |
| 1175 | 7: GETL %EAX, t6 |
| 1176 | 8: GETL %ECX, t8 |
| 1177 | 9: LEA2L 1(t8,t6,2), t4 |
| 1178 | 10: LDB (t4), t10 |
| 1179 | 11: MOVB $0x20, t12 |
| 1180 | 12: ANDB t12, t10 (-wOSZACP) |
| 1181 | 13: INCEIPo $9 |
| 1182 | |
| 1183 | 0x40435A59: jnz-8 0x40435A50 |
| 1184 | |
| 1185 | 14: Jnzo $0x40435A50 (-rOSZACP) |
| 1186 | 15: JMPo $0x40435A5B |
| 1187 | </pre> |
| 1188 | |
| 1189 | <p> |
| 1190 | Notice how the block always ends with an unconditional jump to the |
| 1191 | next block. This is a bit unnecessary, but makes many things simpler. |
| 1192 | |
| 1193 | <p> |
| 1194 | Most x86 instructions turn into sequences of <code>GET</code>, |
| 1195 | </code>PUT</code>, <code>LEA1</code>, <code>LEA2</code>, |
| 1196 | <code>LOAD</code> and <code>STORE</code>. Some complicated ones |
| 1197 | however rely on calling helper bits of code in |
| 1198 | <code>vg_helpers.S</code>. The ucode instructions <code>PUSH</code>, |
| 1199 | <code>POP</code>, <code>CALL</code>, <code>CALLM_S</code> and |
| 1200 | <code>CALLM_E</code> support this. The calling convention is somewhat |
| 1201 | ad-hoc and is not the C calling convention. The helper routines must |
| 1202 | save all integer registers, and the flags, that they use. Args are |
| 1203 | passed on the stack underneath the return address, as usual, and if |
| 1204 | result(s) are to be returned, it (they) are either placed in dummy arg |
| 1205 | slots created by the ucode <code>PUSH</code> sequence, or just |
| 1206 | overwrite the incoming args. |
| 1207 | |
| 1208 | <p> |
| 1209 | In order that the instrumentation mechanism can handle calls to these |
| 1210 | helpers, <code>VG_(saneUCodeBlock)</code> enforces the following |
| 1211 | restrictions on calls to helpers: |
| 1212 | |
| 1213 | <ul> |
| 1214 | <li>Each <code>CALL</code> uinstr must be bracketed by a preceding |
| 1215 | <code>CALLM_S</code> marker (dummy uinstr) and a trailing |
| 1216 | <code>CALLM_E</code> marker. These markers are used by the |
| 1217 | instrumentation mechanism later to establish the boundaries of the |
| 1218 | <code>PUSH</code>, <code>POP</code> and <code>CLEAR</code> |
| 1219 | sequences for the call. |
| 1220 | <p> |
| 1221 | <li><code>PUSH</code>, <code>POP</code> and <code>CLEAR</code> |
| 1222 | may only appear inside sections bracketed by <code>CALLM_S</code> |
| 1223 | and <code>CALLM_E</code>, and nowhere else. |
| 1224 | <p> |
| 1225 | <li>In any such bracketed section, no two <code>PUSH</code> insns may |
| 1226 | push the same <code>TempReg</code>. Dually, no two two |
| 1227 | <code>POP</code>s may pop the same <code>TempReg</code>. |
| 1228 | <p> |
| 1229 | <li>Finally, although this is not checked, args should be removed from |
| 1230 | the stack with <code>CLEAR</code>, rather than <code>POP</code>s |
| 1231 | into a <code>TempReg</code> which is not subsequently used. This |
| 1232 | is because the instrumentation mechanism assumes that all values |
| 1233 | <code>POP</code>ped from the stack are actually used. |
| 1234 | </ul> |
| 1235 | |
| 1236 | Some of the translations may appear to have redundant |
| 1237 | <code>TempReg</code>-to-<code>TempReg</code> moves. This helps the |
| 1238 | next phase, UCode optimisation, to generate better code. |
| 1239 | |
| 1240 | |
| 1241 | |
| 1242 | <h3>UCode optimisation</h3> |
| 1243 | |
| 1244 | UCode is then subjected to an improvement pass |
| 1245 | (<code>vg_improve()</code>), which blurs the boundaries between the |
| 1246 | translations of the original x86 instructions. It's pretty |
| 1247 | straightforward. Three transformations are done: |
| 1248 | |
| 1249 | <ul> |
| 1250 | <li>Redundant <code>GET</code> elimination. Actually, more general |
| 1251 | than that -- eliminates redundant fetches of ArchRegs. In our |
| 1252 | running example, uinstr 3 <code>GET</code>s <code>%EDX</code> into |
| 1253 | <code>t2</code> despite the fact that, by looking at the previous |
| 1254 | uinstr, it is already in <code>t0</code>. The <code>GET</code> is |
| 1255 | therefore removed, and <code>t2</code> renamed to <code>t0</code>. |
| 1256 | Assuming <code>t0</code> is allocated to a host register, it means |
| 1257 | the simulated <code>%EDX</code> will exist in a host CPU register |
| 1258 | for more than one simulated x86 instruction, which seems to me to |
| 1259 | be a highly desirable property. |
| 1260 | <p> |
| 1261 | There is some mucking around to do with subregisters; |
| 1262 | <code>%AL</code> vs <code>%AH</code> <code>%AX</code> vs |
| 1263 | <code>%EAX</code> etc. I can't remember how it works, but in |
| 1264 | general we are very conservative, and these tend to invalidate the |
| 1265 | caching. |
| 1266 | <p> |
| 1267 | <li>Redundant <code>PUT</code> elimination. This annuls |
| 1268 | <code>PUT</code>s of values back to simulated CPU registers if a |
| 1269 | later <code>PUT</code> would overwrite the earlier |
| 1270 | <code>PUT</code> value, and there is no intervening reads of the |
| 1271 | simulated register (<code>ArchReg</code>). |
| 1272 | <p> |
| 1273 | As before, we are paranoid when faced with subregister references. |
| 1274 | Also, <code>PUT</code>s of <code>%ESP</code> are never annulled, |
| 1275 | because it is vital the instrumenter always has an up-to-date |
| 1276 | <code>%ESP</code> value available, <code>%ESP</code> changes |
| 1277 | affect addressibility of the memory around the simulated stack |
| 1278 | pointer. |
| 1279 | <p> |
| 1280 | The implication of the above paragraph is that the simulated |
| 1281 | machine's registers are only lazily updated once the above two |
| 1282 | optimisation phases have run, with the exception of |
| 1283 | <code>%ESP</code>. <code>TempReg</code>s go dead at the end of |
| 1284 | every basic block, from which is is inferrable that any |
| 1285 | <code>TempReg</code> caching a simulated CPU reg is flushed (back |
| 1286 | into the relevant <code>VG_(baseBlock)</code> slot) at the end of |
| 1287 | every basic block. The further implication is that the simulated |
| 1288 | registers are only up-to-date at in between basic blocks, and not |
| 1289 | at arbitrary points inside basic blocks. And the consequence of |
| 1290 | that is that we can only deliver signals to the client in between |
| 1291 | basic blocks. None of this seems any problem in practice. |
| 1292 | <p> |
| 1293 | <li>Finally there is a simple def-use thing for condition codes. If |
| 1294 | an earlier uinstr writes the condition codes, and the next uinsn |
| 1295 | along which actually cares about the condition codes writes the |
| 1296 | same or larger set of them, but does not read any, the earlier |
| 1297 | uinsn is marked as not writing any condition codes. This saves |
| 1298 | a lot of redundant cond-code saving and restoring. |
| 1299 | </ul> |
| 1300 | |
| 1301 | The effect of these transformations on our short block is rather |
| 1302 | unexciting, and shown below. On longer basic blocks they can |
| 1303 | dramatically improve code quality. |
| 1304 | |
| 1305 | <pre> |
| 1306 | at 3: delete GET, rename t2 to t0 in (4 .. 6) |
| 1307 | at 7: delete GET, rename t6 to t0 in (8 .. 9) |
| 1308 | at 1: annul flag write OSZAP due to later OSZACP |
| 1309 | |
| 1310 | Improved code: |
| 1311 | 0: GETL %EDX, t0 |
| 1312 | 1: INCL t0 |
| 1313 | 2: PUTL t0, %EDX |
| 1314 | 4: LDB (t0), t0 |
| 1315 | 5: WIDENL_Bs t0 |
| 1316 | 6: PUTL t0, %EAX |
| 1317 | 8: GETL %ECX, t8 |
| 1318 | 9: LEA2L 1(t8,t0,2), t4 |
| 1319 | 10: LDB (t4), t10 |
| 1320 | 11: MOVB $0x20, t12 |
| 1321 | 12: ANDB t12, t10 (-wOSZACP) |
| 1322 | 13: INCEIPo $9 |
| 1323 | 14: Jnzo $0x40435A50 (-rOSZACP) |
| 1324 | 15: JMPo $0x40435A5B |
| 1325 | </pre> |
| 1326 | |
| 1327 | <h3>UCode instrumentation</h3> |
| 1328 | |
| 1329 | Once you understand the meaning of the instrumentation uinstrs, |
| 1330 | discussed in detail above, the instrumentation scheme is fairly |
| 1331 | straighforward. Each uinstr is instrumented in isolation, and the |
| 1332 | instrumentation uinstrs are placed before the original uinstr. |
| 1333 | Our running example continues below. I have placed a blank line |
| 1334 | after every original ucode, to make it easier to see which |
| 1335 | instrumentation uinstrs correspond to which originals. |
| 1336 | |
| 1337 | <p> |
| 1338 | As mentioned somewhere above, <code>TempReg</code>s carrying values |
| 1339 | have names like <code>t28</code>, and each one has a shadow carrying |
| 1340 | its V bits, with names like <code>q28</code>. This pairing aids in |
| 1341 | reading instrumented ucode. |
| 1342 | |
| 1343 | <p> |
| 1344 | One decision about all this is where to have "observation points", |
| 1345 | that is, where to check that V bits are valid. I use a minimalistic |
| 1346 | scheme, only checking where a failure of validity could cause the |
| 1347 | original program to (seg)fault. So the use of values as memory |
| 1348 | addresses causes a check, as do conditional jumps (these cause a check |
| 1349 | on the definedness of the condition codes). And arguments |
| 1350 | <code>PUSH</code>ed for helper calls are checked, hence the wierd |
| 1351 | restrictions on help call preambles described above. |
| 1352 | |
| 1353 | <p> |
| 1354 | Another decision is that once a value is tested, it is thereafter |
| 1355 | regarded as defined, so that we do not emit multiple undefined-value |
| 1356 | errors for the same undefined value. That means that |
| 1357 | <code>TESTV</code> uinstrs are always followed by <code>SETV</code> |
| 1358 | on the same (shadow) <code>TempReg</code>s. Most of these |
| 1359 | <code>SETV</code>s are redundant and are removed by the |
| 1360 | post-instrumentation cleanup phase. |
| 1361 | |
| 1362 | <p> |
| 1363 | The instrumentation for calling helper functions deserves further |
| 1364 | comment. The definedness of results from a helper is modelled using |
| 1365 | just one V bit. So, in short, we do pessimising casts of the |
| 1366 | definedness of all the args, down to a single bit, and then |
| 1367 | <code>UifU</code> these bits together. So this single V bit will say |
| 1368 | "undefined" if any part of any arg is undefined. This V bit is then |
| 1369 | pessimally cast back up to the result(s) sizes, as needed. If, by |
| 1370 | seeing that all the args are got rid of with <code>CLEAR</code> and |
| 1371 | none with <code>POP</code>, Valgrind sees that the result of the call |
| 1372 | is not actually used, it immediately examines the result V bit with a |
| 1373 | <code>TESTV</code> -- <code>SETV</code> pair. If it did not do this, |
| 1374 | there would be no observation point to detect that the some of the |
| 1375 | args to the helper were undefined. Of course, if the helper's results |
| 1376 | are indeed used, we don't do this, since the result usage will |
| 1377 | presumably cause the result definedness to be checked at some suitable |
| 1378 | future point. |
| 1379 | |
| 1380 | <p> |
| 1381 | In general Valgrind tries to track definedness on a bit-for-bit basis, |
| 1382 | but as the above para shows, for calls to helpers we throw in the |
| 1383 | towel and approximate down to a single bit. This is because it's too |
| 1384 | complex and difficult to track bit-level definedness through complex |
| 1385 | ops such as integer multiply and divide, and in any case there is no |
| 1386 | reasonable code fragments which attempt to (eg) multiply two |
| 1387 | partially-defined values and end up with something meaningful, so |
| 1388 | there seems little point in modelling multiplies, divides, etc, in |
| 1389 | that level of detail. |
| 1390 | |
| 1391 | <p> |
| 1392 | Integer loads and stores are instrumented with firstly a test of the |
| 1393 | definedness of the address, followed by a <code>LOADV</code> or |
| 1394 | <code>STOREV</code> respectively. These turn into calls to |
| 1395 | (for example) <code>VG_(helperc_LOADV4)</code>. These helpers do two |
| 1396 | things: they perform an address-valid check, and they load or store V |
| 1397 | bits from/to the relevant address in the (simulated V-bit) memory. |
| 1398 | |
| 1399 | <p> |
| 1400 | FPU loads and stores are different. As above the definedness of the |
| 1401 | address is first tested. However, the helper routine for FPU loads |
| 1402 | (<code>VGM_(fpu_read_check)</code>) emits an error if either the |
| 1403 | address is invalid or the referenced area contains undefined values. |
| 1404 | It has to do this because we do not simulate the FPU at all, and so |
| 1405 | cannot track definedness of values loaded into it from memory, so we |
| 1406 | have to check them as soon as they are loaded into the FPU, ie, at |
| 1407 | this point. We notionally assume that everything in the FPU is |
| 1408 | defined. |
| 1409 | |
| 1410 | <p> |
| 1411 | It follows therefore that FPU writes first check the definedness of |
| 1412 | the address, then the validity of the address, and finally mark the |
| 1413 | written bytes as well-defined. |
| 1414 | |
| 1415 | <p> |
| 1416 | If anyone is inspired to extend Valgrind to MMX/SSE insns, I suggest |
| 1417 | you use the same trick. It works provided that the FPU/MMX unit is |
| 1418 | not used to merely as a conduit to copy partially undefined data from |
| 1419 | one place in memory to another. Unfortunately the integer CPU is used |
| 1420 | like that (when copying C structs with holes, for example) and this is |
| 1421 | the cause of much of the elaborateness of the instrumentation here |
| 1422 | described. |
| 1423 | |
| 1424 | <p> |
| 1425 | <code>vg_instrument()</code> in <code>vg_translate.c</code> actually |
| 1426 | does the instrumentation. There are comments explaining how each |
| 1427 | uinstr is handled, so we do not repeat that here. As explained |
| 1428 | already, it is bit-accurate, except for calls to helper functions. |
| 1429 | Unfortunately the x86 insns <code>bt/bts/btc/btr</code> are done by |
| 1430 | helper fns, so bit-level accuracy is lost there. This should be fixed |
| 1431 | by doing them inline; it will probably require adding a couple new |
| 1432 | uinstrs. Also, left and right rotates through the carry flag (x86 |
| 1433 | <code>rcl</code> and <code>rcr</code>) are approximated via a single |
| 1434 | V bit; so far this has not caused anyone to complain. The |
| 1435 | non-carry rotates, <code>rol</code> and <code>ror</code>, are much |
| 1436 | more common and are done exactly. Re-visiting the instrumentation for |
| 1437 | AND and OR, they seem rather verbose, and I wonder if it could be done |
| 1438 | more concisely now. |
| 1439 | |
| 1440 | <p> |
| 1441 | The lowercase <code>o</code> on many of the uopcodes in the running |
| 1442 | example indicates that the size field is zero, usually meaning a |
| 1443 | single-bit operation. |
| 1444 | |
| 1445 | <p> |
| 1446 | Anyroads, the post-instrumented version of our running example looks |
| 1447 | like this: |
| 1448 | |
| 1449 | <pre> |
| 1450 | Instrumented code: |
| 1451 | 0: GETVL %EDX, q0 |
| 1452 | 1: GETL %EDX, t0 |
| 1453 | |
| 1454 | 2: TAG1o q0 = Left4 ( q0 ) |
| 1455 | 3: INCL t0 |
| 1456 | |
| 1457 | 4: PUTVL q0, %EDX |
| 1458 | 5: PUTL t0, %EDX |
| 1459 | |
| 1460 | 6: TESTVL q0 |
| 1461 | 7: SETVL q0 |
| 1462 | 8: LOADVB (t0), q0 |
| 1463 | 9: LDB (t0), t0 |
| 1464 | |
| 1465 | 10: TAG1o q0 = SWiden14 ( q0 ) |
| 1466 | 11: WIDENL_Bs t0 |
| 1467 | |
| 1468 | 12: PUTVL q0, %EAX |
| 1469 | 13: PUTL t0, %EAX |
| 1470 | |
| 1471 | 14: GETVL %ECX, q8 |
| 1472 | 15: GETL %ECX, t8 |
| 1473 | |
| 1474 | 16: MOVL q0, q4 |
| 1475 | 17: SHLL $0x1, q4 |
| 1476 | 18: TAG2o q4 = UifU4 ( q8, q4 ) |
| 1477 | 19: TAG1o q4 = Left4 ( q4 ) |
| 1478 | 20: LEA2L 1(t8,t0,2), t4 |
| 1479 | |
| 1480 | 21: TESTVL q4 |
| 1481 | 22: SETVL q4 |
| 1482 | 23: LOADVB (t4), q10 |
| 1483 | 24: LDB (t4), t10 |
| 1484 | |
| 1485 | 25: SETVB q12 |
| 1486 | 26: MOVB $0x20, t12 |
| 1487 | |
| 1488 | 27: MOVL q10, q14 |
| 1489 | 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 ) |
| 1490 | 29: TAG2o q10 = UifU1 ( q12, q10 ) |
| 1491 | 30: TAG2o q10 = DifD1 ( q14, q10 ) |
| 1492 | 31: MOVL q12, q14 |
| 1493 | 32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 ) |
| 1494 | 33: TAG2o q10 = DifD1 ( q14, q10 ) |
| 1495 | 34: MOVL q10, q16 |
| 1496 | 35: TAG1o q16 = PCast10 ( q16 ) |
| 1497 | 36: PUTVFo q16 |
| 1498 | 37: ANDB t12, t10 (-wOSZACP) |
| 1499 | |
| 1500 | 38: INCEIPo $9 |
| 1501 | |
| 1502 | 39: GETVFo q18 |
| 1503 | 40: TESTVo q18 |
| 1504 | 41: SETVo q18 |
| 1505 | 42: Jnzo $0x40435A50 (-rOSZACP) |
| 1506 | |
| 1507 | 43: JMPo $0x40435A5B |
| 1508 | </pre> |
| 1509 | |
| 1510 | |
| 1511 | <h3>UCode post-instrumentation cleanup</h3> |
| 1512 | |
| 1513 | <p> |
| 1514 | This pass, coordinated by <code>vg_cleanup()</code>, removes redundant |
| 1515 | definedness computation created by the simplistic instrumentation |
| 1516 | pass. It consists of two passes, |
| 1517 | <code>vg_propagate_definedness()</code> followed by |
| 1518 | <code>vg_delete_redundant_SETVs</code>. |
| 1519 | |
| 1520 | <p> |
| 1521 | <code>vg_propagate_definedness()</code> is a simple |
| 1522 | constant-propagation and constant-folding pass. It tries to determine |
| 1523 | which <code>TempReg</code>s containing V bits will always indicate |
| 1524 | "fully defined", and it propagates this information as far as it can, |
| 1525 | and folds out as many operations as possible. For example, the |
| 1526 | instrumentation for an ADD of a literal to a variable quantity will be |
| 1527 | reduced down so that the definedness of the result is simply the |
| 1528 | definedness of the variable quantity, since the literal is by |
| 1529 | definition fully defined. |
| 1530 | |
| 1531 | <p> |
| 1532 | <code>vg_delete_redundant_SETVs</code> removes <code>SETV</code>s on |
| 1533 | shadow <code>TempReg</code>s for which the next action is a write. |
| 1534 | I don't think there's anything else worth saying about this; it is |
| 1535 | simple. Read the sources for details. |
| 1536 | |
| 1537 | <p> |
| 1538 | So the cleaned-up running example looks like this. As above, I have |
| 1539 | inserted line breaks after every original (non-instrumentation) uinstr |
| 1540 | to aid readability. As with straightforward ucode optimisation, the |
| 1541 | results in this block are undramatic because it is so short; longer |
| 1542 | blocks benefit more because they have more redundancy which gets |
| 1543 | eliminated. |
| 1544 | |
| 1545 | |
| 1546 | <pre> |
| 1547 | at 29: delete UifU1 due to defd arg1 |
| 1548 | at 32: change ImproveAND1_TQ to MOV due to defd arg2 |
| 1549 | at 41: delete SETV |
| 1550 | at 31: delete MOV |
| 1551 | at 25: delete SETV |
| 1552 | at 22: delete SETV |
| 1553 | at 7: delete SETV |
| 1554 | |
| 1555 | 0: GETVL %EDX, q0 |
| 1556 | 1: GETL %EDX, t0 |
| 1557 | |
| 1558 | 2: TAG1o q0 = Left4 ( q0 ) |
| 1559 | 3: INCL t0 |
| 1560 | |
| 1561 | 4: PUTVL q0, %EDX |
| 1562 | 5: PUTL t0, %EDX |
| 1563 | |
| 1564 | 6: TESTVL q0 |
| 1565 | 8: LOADVB (t0), q0 |
| 1566 | 9: LDB (t0), t0 |
| 1567 | |
| 1568 | 10: TAG1o q0 = SWiden14 ( q0 ) |
| 1569 | 11: WIDENL_Bs t0 |
| 1570 | |
| 1571 | 12: PUTVL q0, %EAX |
| 1572 | 13: PUTL t0, %EAX |
| 1573 | |
| 1574 | 14: GETVL %ECX, q8 |
| 1575 | 15: GETL %ECX, t8 |
| 1576 | |
| 1577 | 16: MOVL q0, q4 |
| 1578 | 17: SHLL $0x1, q4 |
| 1579 | 18: TAG2o q4 = UifU4 ( q8, q4 ) |
| 1580 | 19: TAG1o q4 = Left4 ( q4 ) |
| 1581 | 20: LEA2L 1(t8,t0,2), t4 |
| 1582 | |
| 1583 | 21: TESTVL q4 |
| 1584 | 23: LOADVB (t4), q10 |
| 1585 | 24: LDB (t4), t10 |
| 1586 | |
| 1587 | 26: MOVB $0x20, t12 |
| 1588 | |
| 1589 | 27: MOVL q10, q14 |
| 1590 | 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 ) |
| 1591 | 30: TAG2o q10 = DifD1 ( q14, q10 ) |
| 1592 | 32: MOVL t12, q14 |
| 1593 | 33: TAG2o q10 = DifD1 ( q14, q10 ) |
| 1594 | 34: MOVL q10, q16 |
| 1595 | 35: TAG1o q16 = PCast10 ( q16 ) |
| 1596 | 36: PUTVFo q16 |
| 1597 | 37: ANDB t12, t10 (-wOSZACP) |
| 1598 | |
| 1599 | 38: INCEIPo $9 |
| 1600 | 39: GETVFo q18 |
| 1601 | 40: TESTVo q18 |
| 1602 | 42: Jnzo $0x40435A50 (-rOSZACP) |
| 1603 | |
| 1604 | 43: JMPo $0x40435A5B |
| 1605 | </pre> |
| 1606 | |
| 1607 | |
| 1608 | <h3>Translation from UCode</h3> |
| 1609 | |
| 1610 | This is all very simple, even though <code>vg_from_ucode.c</code> |
| 1611 | is a big file. Position-independent x86 code is generated into |
| 1612 | a dynamically allocated array <code>emitted_code</code>; this is |
| 1613 | doubled in size when it overflows. Eventually the array is handed |
| 1614 | back to the caller of <code>VG_(translate)</code>, who must copy |
| 1615 | the result into TC and TT, and free the array. |
| 1616 | |
| 1617 | <p> |
| 1618 | This file is structured into four layers of abstraction, which, |
| 1619 | thankfully, are glued back together with extensive |
| 1620 | <code>__inline__</code> directives. From the bottom upwards: |
| 1621 | |
| 1622 | <ul> |
| 1623 | <li>Address-mode emitters, <code>emit_amode_regmem_reg</code> et al. |
| 1624 | <p> |
| 1625 | <li>Emitters for specific x86 instructions. There are quite a lot of |
| 1626 | these, with names such as <code>emit_movv_offregmem_reg</code>. |
| 1627 | The <code>v</code> suffix is Intel parlance for a 16/32 bit insn; |
| 1628 | there are also <code>b</code> suffixes for 8 bit insns. |
| 1629 | <p> |
| 1630 | <li>The next level up are the <code>synth_*</code> functions, which |
| 1631 | synthesise possibly a sequence of raw x86 instructions to do some |
| 1632 | simple task. Some of these are quite complex because they have to |
| 1633 | work around Intel's silly restrictions on subregister naming. See |
| 1634 | <code>synth_nonshiftop_reg_reg</code> for example. |
| 1635 | <p> |
| 1636 | <li>Finally, at the top of the heap, we have |
| 1637 | <code>emitUInstr()</code>, |
| 1638 | which emits code for a single uinstr. |
| 1639 | </ul> |
| 1640 | |
| 1641 | <p> |
| 1642 | Some comments: |
| 1643 | <ul> |
| 1644 | <li>The hack for FPU instructions becomes apparent here. To do a |
| 1645 | <code>FPU</code> ucode instruction, we load the simulated FPU's |
| 1646 | state into from its <code>VG_(baseBlock)</code> into the real FPU |
| 1647 | using an x86 <code>frstor</code> insn, do the ucode |
| 1648 | <code>FPU</code> insn on the real CPU, and write the updated FPU |
| 1649 | state back into <code>VG_(baseBlock)</code> using an |
| 1650 | <code>fnsave</code> instruction. This is pretty brutal, but is |
| 1651 | simple and it works, and even seems tolerably efficient. There is |
| 1652 | no attempt to cache the simulated FPU state in the real FPU over |
| 1653 | multiple back-to-back ucode FPU instructions. |
| 1654 | <p> |
| 1655 | <code>FPU_R</code> and <code>FPU_W</code> are also done this way, |
| 1656 | with the minor complication that we need to patch in some |
| 1657 | addressing mode bits so the resulting insn knows the effective |
| 1658 | address to use. This is easy because of the regularity of the x86 |
| 1659 | FPU instruction encodings. |
| 1660 | <p> |
| 1661 | <li>An analogous trick is done with ucode insns which claim, in their |
| 1662 | <code>flags_r</code> and <code>flags_w</code> fields, that they |
| 1663 | read or write the simulated <code>%EFLAGS</code>. For such cases |
| 1664 | we first copy the simulated <code>%EFLAGS</code> into the real |
| 1665 | <code>%eflags</code>, then do the insn, then, if the insn says it |
| 1666 | writes the flags, copy back to <code>%EFLAGS</code>. This is a |
| 1667 | bit expensive, which is why the ucode optimisation pass goes to |
| 1668 | some effort to remove redundant flag-update annotations. |
| 1669 | </ul> |
| 1670 | |
| 1671 | <p> |
| 1672 | And so ... that's the end of the documentation for the instrumentating |
| 1673 | translator! It's really not that complex, because it's composed as a |
| 1674 | sequence of simple(ish) self-contained transformations on |
| 1675 | straight-line blocks of code. |
| 1676 | |
| 1677 | |
| 1678 | <h3>Top-level dispatch loop</h3> |
| 1679 | |
| 1680 | Urk. In <code>VG_(toploop)</code>. This is basically boring and |
| 1681 | unsurprising, not to mention fiddly and fragile. It needs to be |
| 1682 | cleaned up. |
| 1683 | |
| 1684 | <p> |
| 1685 | The only perhaps surprise is that the whole thing is run |
| 1686 | on top of a <code>setjmp</code>-installed exception handler, because, |
| 1687 | supposing a translation got a segfault, we have to bail out of the |
| 1688 | Valgrind-supplied exception handler <code>VG_(oursignalhandler)</code> |
| 1689 | and immediately start running the client's segfault handler, if it has |
| 1690 | one. In particular we can't finish the current basic block and then |
| 1691 | deliver the signal at some convenient future point, because signals |
| 1692 | like SIGILL, SIGSEGV and SIGBUS mean that the faulting insn should not |
| 1693 | simply be re-tried. (I'm sure there is a clearer way to explain this). |
| 1694 | |
| 1695 | |
| 1696 | <h3>Exceptions, creating new translations</h3> |
| 1697 | <h3>Self-modifying code</h3> |
| 1698 | |
| 1699 | <h3>Lazy updates of the simulated program counter</h3> |
| 1700 | |
| 1701 | Simulated <code>%EIP</code> is not updated after every simulated x86 |
| 1702 | insn as this was regarded as too expensive. Instead ucode |
| 1703 | <code>INCEIP</code> insns move it along as and when necessary. |
| 1704 | Currently we don't allow it to fall more than 4 bytes behind reality |
| 1705 | (see <code>VG_(disBB)</code> for the way this works). |
| 1706 | <p> |
| 1707 | Note that <code>%EIP</code> is always brought up to date by the inner |
| 1708 | dispatch loop in <code>VG_(dispatch)</code>, so that if the client |
| 1709 | takes a fault we know at least which basic block this happened in. |
| 1710 | |
| 1711 | |
| 1712 | <h3>The translation cache and translation table</h3> |
| 1713 | |
| 1714 | <h3>Signals</h3> |
| 1715 | |
| 1716 | Horrible, horrible. <code>vg_signals.c</code>. |
| 1717 | Basically, since we have to intercept all system |
| 1718 | calls anyway, we can see when the client tries to install a signal |
| 1719 | handler. If it does so, we make a note of what the client asked to |
| 1720 | happen, and ask the kernel to route the signal to our own signal |
| 1721 | handler, <code>VG_(oursignalhandler)</code>. This simply notes the |
| 1722 | delivery of signals, and returns. |
| 1723 | |
| 1724 | <p> |
| 1725 | Every 1000 basic blocks, we see if more signals have arrived. If so, |
| 1726 | <code>VG_(deliver_signals)</code> builds signal delivery frames on the |
| 1727 | client's stack, and allows their handlers to be run. Valgrind places |
| 1728 | in these signal delivery frames a bogus return address, |
| 1729 | </code>VG_(signalreturn_bogusRA)</code>, and checks all jumps to see |
| 1730 | if any jump to it. If so, this is a sign that a signal handler is |
| 1731 | returning, and if so Valgrind removes the relevant signal frame from |
| 1732 | the client's stack, restores the from the signal frame the simulated |
| 1733 | state before the signal was delivered, and allows the client to run |
| 1734 | onwards. We have to do it this way because some signal handlers never |
| 1735 | return, they just <code>longjmp()</code>, which nukes the signal |
| 1736 | delivery frame. |
| 1737 | |
| 1738 | <p> |
| 1739 | The Linux kernel has a different but equally horrible hack for |
| 1740 | detecting signal handler returns. Discovering it is left as an |
| 1741 | exercise for the reader. |
| 1742 | |
| 1743 | |
| 1744 | |
| 1745 | <h3>Errors, error contexts, error reporting, suppressions</h3> |
| 1746 | <h3>Client malloc/free</h3> |
| 1747 | <h3>Low-level memory management</h3> |
| 1748 | <h3>A and V bitmaps</h3> |
| 1749 | <h3>Symbol table management</h3> |
| 1750 | <h3>Dealing with system calls</h3> |
| 1751 | <h3>Namespace management</h3> |
| 1752 | <h3>GDB attaching</h3> |
| 1753 | <h3>Non-dependence on glibc or anything else</h3> |
| 1754 | <h3>The leak detector</h3> |
| 1755 | <h3>Performance problems</h3> |
| 1756 | <h3>Continuous sanity checking</h3> |
| 1757 | <h3>Tracing, or not tracing, child processes</h3> |
| 1758 | <h3>Assembly glue for syscalls</h3> |
| 1759 | |
| 1760 | |
| 1761 | <hr width="100%"> |
| 1762 | |
| 1763 | <h2>Extensions</h2> |
| 1764 | |
| 1765 | Some comments about Stuff To Do. |
| 1766 | |
| 1767 | <h3>Bugs</h3> |
| 1768 | |
| 1769 | Stephan Kulow and Marc Mutz report problems with kmail in KDE 3 CVS |
| 1770 | (RC2 ish) when run on Valgrind. Stephan has it deadlocking; Marc has |
| 1771 | it looping at startup. I can't repro either behaviour. Needs |
| 1772 | repro-ing and fixing. |
| 1773 | |
| 1774 | |
| 1775 | <h3>Threads</h3> |
| 1776 | |
| 1777 | Doing a good job of thread support strikes me as almost a |
| 1778 | research-level problem. The central issues are how to do fast cheap |
| 1779 | locking of the <code>VG_(primary_map)</code> structure, whether or not |
| 1780 | accesses to the individual secondary maps need locking, what |
| 1781 | race-condition issues result, and whether the already-nasty mess that |
| 1782 | is the signal simulator needs further hackery. |
| 1783 | |
| 1784 | <p> |
| 1785 | I realise that threads are the most-frequently-requested feature, and |
| 1786 | I am thinking about it all. If you have guru-level understanding of |
| 1787 | fast mutual exclusion mechanisms and race conditions, I would be |
| 1788 | interested in hearing from you. |
| 1789 | |
| 1790 | |
| 1791 | <h3>Verification suite</h3> |
| 1792 | |
| 1793 | Directory <code>tests/</code> contains various ad-hoc tests for |
| 1794 | Valgrind. However, there is no systematic verification or regression |
| 1795 | suite, that, for example, exercises all the stuff in |
| 1796 | <code>vg_memory.c</code>, to ensure that illegal memory accesses and |
| 1797 | undefined value uses are detected as they should be. It would be good |
| 1798 | to have such a suite. |
| 1799 | |
| 1800 | |
| 1801 | <h3>Porting to other platforms</h3> |
| 1802 | |
| 1803 | It would be great if Valgrind was ported to FreeBSD and x86 NetBSD, |
| 1804 | and to x86 OpenBSD, if it's possible (doesn't OpenBSD use a.out-style |
| 1805 | executables, not ELF ?) |
| 1806 | |
| 1807 | <p> |
| 1808 | The main difficulties, for an x86-ELF platform, seem to be: |
| 1809 | |
| 1810 | <ul> |
| 1811 | <li>You'd need to rewrite the <code>/proc/self/maps</code> parser |
| 1812 | (<code>vg_procselfmaps.c</code>). |
| 1813 | Easy. |
| 1814 | <p> |
| 1815 | <li>You'd need to rewrite <code>vg_syscall_mem.c</code>, or, more |
| 1816 | specifically, provide one for your OS. This is tedious, but you |
| 1817 | can implement syscalls on demand, and the Linux kernel interface |
| 1818 | is, for the most part, going to look very similar to the *BSD |
| 1819 | interfaces, so it's really a copy-paste-and-modify-on-demand job. |
| 1820 | As part of this, you'd need to supply a new |
| 1821 | <code>vg_kerneliface.h</code> file. |
| 1822 | <p> |
| 1823 | <li>You'd also need to change the syscall wrappers for Valgrind's |
| 1824 | internal use, in <code>vg_mylibc.c</code>. |
| 1825 | </ul> |
| 1826 | |
| 1827 | All in all, I think a port to x86-ELF *BSDs is not really very |
| 1828 | difficult, and in some ways I would like to see it happen, because |
| 1829 | that would force a more clear factoring of Valgrind into platform |
| 1830 | dependent and independent pieces. Not to mention, *BSD folks also |
| 1831 | deserve to use Valgrind just as much as the Linux crew do. |
| 1832 | |
| 1833 | |
| 1834 | <p> |
| 1835 | <hr width="100%"> |
| 1836 | |
| 1837 | <h2>Easy stuff which ought to be done</h2> |
| 1838 | |
| 1839 | <h3>MMX instructions</h3> |
| 1840 | |
| 1841 | MMX insns should be supported, using the same trick as for FPU insns. |
| 1842 | If the MMX registers are not used to copy uninitialised junk from one |
| 1843 | place to another in memory, this means we don't have to actually |
| 1844 | simulate the internal MMX unit state, so the FPU hack applies. This |
| 1845 | should be fairly easy. |
| 1846 | |
| 1847 | |
| 1848 | |
| 1849 | <h3>Fix stabs-info reader</h3> |
| 1850 | |
| 1851 | The machinery in <code>vg_symtab2.c</code> which reads "stabs" style |
| 1852 | debugging info is pretty weak. It usually correctly translates |
| 1853 | simulated program counter values into line numbers and procedure |
| 1854 | names, but the file name is often completely wrong. I think the |
| 1855 | logic used to parse "stabs" entries is weak. It should be fixed. |
| 1856 | The simplest solution, IMO, is to copy either the logic or simply the |
| 1857 | code out of GNU binutils which does this; since GDB can clearly get it |
| 1858 | right, binutils (or GDB?) must have code to do this somewhere. |
| 1859 | |
| 1860 | |
| 1861 | |
| 1862 | |
| 1863 | |
| 1864 | <h3>BT/BTC/BTS/BTR</h3> |
| 1865 | |
| 1866 | These are x86 instructions which test, complement, set, or reset, a |
| 1867 | single bit in a word. At the moment they are both incorrectly |
| 1868 | implemented and incorrectly instrumented. |
| 1869 | |
| 1870 | <p> |
| 1871 | The incorrect instrumentation is due to use of helper functions. This |
| 1872 | means we lose bit-level definedness tracking, which could wind up |
| 1873 | giving spurious uninitialised-value use errors. The Right Thing to do |
| 1874 | is to invent a couple of new UOpcodes, I think <code>GET_BIT</code> |
| 1875 | and <code>SET_BIT</code>, which can be used to implement all 4 x86 |
| 1876 | insns, get rid of the helpers, and give bit-accurate instrumentation |
| 1877 | rules for the two new UOpcodes. |
| 1878 | |
| 1879 | <p> |
| 1880 | I realised the other day that they are mis-implemented too. The x86 |
| 1881 | insns take a bit-index and a register or memory location to access. |
| 1882 | For registers the bit index clearly can only be in the range zero to |
| 1883 | register-width minus 1, and I assumed the same applied to memory |
| 1884 | locations too. But evidently not; for memory locations the index can |
| 1885 | be arbitrary, and the processor will index arbitrarily into memory as |
| 1886 | a result. This too should be fixed. Sigh. Presumably indexing |
| 1887 | outside the immediate word is not actually used by any programs yet |
| 1888 | tested on Valgrind, for otherwise they (presumably) would simply not |
| 1889 | work at all. If you plan to hack on this, first check the Intel docs |
| 1890 | to make sure my understanding is really correct. |
| 1891 | |
| 1892 | |
| 1893 | |
| 1894 | <h3>Using PREFETCH instructions</h3> |
| 1895 | |
| 1896 | Here's a small but potentially interesting project for performance |
| 1897 | junkies. Experiments with valgrind's code generator and optimiser(s) |
| 1898 | suggest that reducing the number of instructions executed in the |
| 1899 | translations and mem-check helpers gives disappointingly small |
| 1900 | performance improvements. Perhaps this is because performance of |
| 1901 | Valgrindified code is limited by cache misses. After all, each read |
| 1902 | in the original program now gives rise to at least three reads, one |
| 1903 | for the <code>VG_(primary_map)</code>, one of the resulting |
| 1904 | secondary, and the original. Not to mention, the instrumented |
| 1905 | translations are 13 to 14 times larger than the originals. All in all |
| 1906 | one would expect the memory system to be hammered to hell and then |
| 1907 | some. |
| 1908 | |
| 1909 | <p> |
| 1910 | So here's an idea. An x86 insn involving a read from memory, after |
| 1911 | instrumentation, will turn into ucode of the following form: |
| 1912 | <pre> |
| 1913 | ... calculate effective addr, into ta and qa ... |
| 1914 | TESTVL qa -- is the addr defined? |
| 1915 | LOADV (ta), qloaded -- fetch V bits for the addr |
| 1916 | LOAD (ta), tloaded -- do the original load |
| 1917 | </pre> |
| 1918 | At the point where the <code>LOADV</code> is done, we know the actual |
| 1919 | address (<code>ta</code>) from which the real <code>LOAD</code> will |
| 1920 | be done. We also know that the <code>LOADV</code> will take around |
| 1921 | 20 x86 insns to do. So it seems plausible that doing a prefetch of |
| 1922 | <code>ta</code> just before the <code>LOADV</code> might just avoid a |
| 1923 | miss at the <code>LOAD</code> point, and that might be a significant |
| 1924 | performance win. |
| 1925 | |
| 1926 | <p> |
| 1927 | Prefetch insns are notoriously tempermental, more often than not |
| 1928 | making things worse rather than better, so this would require |
| 1929 | considerable fiddling around. It's complicated because Intels and |
| 1930 | AMDs have different prefetch insns with different semantics, so that |
| 1931 | too needs to be taken into account. As a general rule, even placing |
| 1932 | the prefetches before the <code>LOADV</code> insn is too near the |
| 1933 | <code>LOAD</code>; the ideal distance is apparently circa 200 CPU |
| 1934 | cycles. So it might be worth having another analysis/transformation |
| 1935 | pass which pushes prefetches as far back as possible, hopefully |
| 1936 | immediately after the effective address becomes available. |
| 1937 | |
| 1938 | <p> |
| 1939 | Doing too many prefetches is also bad because they soak up bus |
| 1940 | bandwidth / cpu resources, so some cleverness in deciding which loads |
| 1941 | to prefetch and which to not might be helpful. One can imagine not |
| 1942 | prefetching client-stack-relative (<code>%EBP</code> or |
| 1943 | <code>%ESP</code>) accesses, since the stack in general tends to show |
| 1944 | good locality anyway. |
| 1945 | |
| 1946 | <p> |
| 1947 | There's quite a lot of experimentation to do here, but I think it |
| 1948 | might make an interesting week's work for someone. |
| 1949 | |
| 1950 | <p> |
| 1951 | As of 15-ish March 2002, I've started to experiment with this, using |
| 1952 | the AMD <code>prefetch/prefetchw</code> insns. |
| 1953 | |
| 1954 | |
| 1955 | |
| 1956 | <h3>User-defined permission ranges</h3> |
| 1957 | |
| 1958 | This is quite a large project -- perhaps a month's hacking for a |
| 1959 | capable hacker to do a good job -- but it's potentially very |
| 1960 | interesting. The outcome would be that Valgrind could detect a |
| 1961 | whole class of bugs which it currently cannot. |
| 1962 | |
| 1963 | <p> |
| 1964 | The presentation falls into two pieces. |
| 1965 | |
| 1966 | <p> |
| 1967 | <b>Part 1: user-defined address-range permission setting</b> |
| 1968 | <p> |
| 1969 | |
| 1970 | Valgrind intercepts the client's <code>malloc</code>, |
| 1971 | <code>free</code>, etc calls, watches system calls, and watches the |
| 1972 | stack pointer move. This is currently the only way it knows about |
| 1973 | which addresses are valid and which not. Sometimes the client program |
| 1974 | knows extra information about its memory areas. For example, the |
| 1975 | client could at some point know that all elements of an array are |
| 1976 | out-of-date. We would like to be able to convey to Valgrind this |
| 1977 | information that the array is now addressable-but-uninitialised, so |
| 1978 | that Valgrind can then warn if elements are used before they get new |
| 1979 | values. |
| 1980 | |
| 1981 | <p> |
| 1982 | What I would like are some macros like this: |
| 1983 | <pre> |
| 1984 | VALGRIND_MAKE_NOACCESS(addr, len) |
| 1985 | VALGRIND_MAKE_WRITABLE(addr, len) |
| 1986 | VALGRIND_MAKE_READABLE(addr, len) |
| 1987 | </pre> |
| 1988 | and also, to check that memory is addressible/initialised, |
| 1989 | <pre> |
| 1990 | VALGRIND_CHECK_ADDRESSIBLE(addr, len) |
| 1991 | VALGRIND_CHECK_INITIALISED(addr, len) |
| 1992 | </pre> |
| 1993 | |
| 1994 | <p> |
| 1995 | I then include in my sources a header defining these macros, rebuild |
| 1996 | my app, run under Valgrind, and get user-defined checks. |
| 1997 | |
| 1998 | <p> |
| 1999 | Now here's a neat trick. It's a nuisance to have to re-link the app |
| 2000 | with some new library which implements the above macros. So the idea |
| 2001 | is to define the macros so that the resulting executable is still |
| 2002 | completely stand-alone, and can be run without Valgrind, in which case |
| 2003 | the macros do nothing, but when run on Valgrind, the Right Thing |
| 2004 | happens. How to do this? The idea is for these macros to turn into a |
| 2005 | piece of inline assembly code, which (1) has no effect when run on the |
| 2006 | real CPU, (2) is easily spotted by Valgrind's JITter, and (3) no sane |
| 2007 | person would ever write, which is important for avoiding false matches |
| 2008 | in (2). So here's a suggestion: |
| 2009 | <pre> |
| 2010 | VALGRIND_MAKE_NOACCESS(addr, len) |
| 2011 | </pre> |
| 2012 | becomes (roughly speaking) |
| 2013 | <pre> |
| 2014 | movl addr, %eax |
| 2015 | movl len, %ebx |
| 2016 | movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be |
| 2017 | -- 2, etc |
| 2018 | rorl $13, %ecx |
| 2019 | rorl $19, %ecx |
| 2020 | rorl $11, %eax |
| 2021 | rorl $21, %eax |
| 2022 | </pre> |
| 2023 | The rotate sequences have no effect, and it's unlikely they would |
| 2024 | appear for any other reason, but they define a unique byte-sequence |
| 2025 | which the JITter can easily spot. Using the operand constraints |
| 2026 | section at the end of a gcc inline-assembly statement, we can tell gcc |
| 2027 | that the assembly fragment kills <code>%eax</code>, <code>%ebx</code>, |
| 2028 | <code>%ecx</code> and the condition codes, so this fragment is made |
| 2029 | harmless when not running on Valgrind, runs quickly when not on |
| 2030 | Valgrind, and does not require any other library support. |
| 2031 | |
| 2032 | |
| 2033 | <p> |
| 2034 | <b>Part 2: using it to detect interference between stack variables</b> |
| 2035 | <p> |
| 2036 | |
| 2037 | Currently Valgrind cannot detect errors of the following form: |
| 2038 | <pre> |
| 2039 | void fooble ( void ) |
| 2040 | { |
| 2041 | int a[10]; |
| 2042 | int b[10]; |
| 2043 | a[10] = 99; |
| 2044 | } |
| 2045 | </pre> |
| 2046 | Now imagine rewriting this as |
| 2047 | <pre> |
| 2048 | void fooble ( void ) |
| 2049 | { |
| 2050 | int spacer0; |
| 2051 | int a[10]; |
| 2052 | int spacer1; |
| 2053 | int b[10]; |
| 2054 | int spacer2; |
| 2055 | VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int)); |
| 2056 | VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int)); |
| 2057 | VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int)); |
| 2058 | a[10] = 99; |
| 2059 | } |
| 2060 | </pre> |
| 2061 | Now the invalid write is certain to hit <code>spacer0</code> or |
| 2062 | <code>spacer1</code>, so Valgrind will spot the error. |
| 2063 | |
| 2064 | <p> |
| 2065 | There are two complications. |
| 2066 | |
| 2067 | <p> |
| 2068 | The first is that we don't want to annotate sources by hand, so the |
| 2069 | Right Thing to do is to write a C/C++ parser, annotator, prettyprinter |
| 2070 | which does this automatically, and run it on post-CPP'd C/C++ source. |
| 2071 | See http://www.cacheprof.org for an example of a system which |
| 2072 | transparently inserts another phase into the gcc/g++ compilation |
| 2073 | route. The parser/prettyprinter is probably not as hard as it sounds; |
| 2074 | I would write it in Haskell, a powerful functional language well |
| 2075 | suited to doing symbolic computation, with which I am intimately |
| 2076 | familar. There is already a C parser written in Haskell by someone in |
| 2077 | the Haskell community, and that would probably be a good starting |
| 2078 | point. |
| 2079 | |
| 2080 | <p> |
| 2081 | The second complication is how to get rid of these |
| 2082 | <code>NOACCESS</code> records inside Valgrind when the instrumented |
| 2083 | function exits; after all, these refer to stack addresses and will |
| 2084 | make no sense whatever when some other function happens to re-use the |
| 2085 | same stack address range, probably shortly afterwards. I think I |
| 2086 | would be inclined to define a special stack-specific macro |
| 2087 | <pre> |
| 2088 | VALGRIND_MAKE_NOACCESS_STACK(addr, len) |
| 2089 | </pre> |
| 2090 | which causes Valgrind to record the client's <code>%ESP</code> at the |
| 2091 | time it is executed. Valgrind will then watch for changes in |
| 2092 | <code>%ESP</code> and discard such records as soon as the protected |
| 2093 | area is uncovered by an increase in <code>%ESP</code>. I hesitate |
| 2094 | with this scheme only because it is potentially expensive, if there |
| 2095 | are hundreds of such records, and considering that changes in |
| 2096 | <code>%ESP</code> already require expensive messing with stack access |
| 2097 | permissions. |
| 2098 | |
| 2099 | <p> |
| 2100 | This is probably easier and more robust than for the instrumenter |
| 2101 | program to try and spot all exit points for the procedure and place |
| 2102 | suitable deallocation annotations there. Plus C++ procedures can |
| 2103 | bomb out at any point if they get an exception, so spotting return |
| 2104 | points at the source level just won't work at all. |
| 2105 | |
| 2106 | <p> |
| 2107 | Although some work, it's all eminently doable, and it would make |
| 2108 | Valgrind into an even-more-useful tool. |
| 2109 | |
| 2110 | <p> |
| 2111 | Update: as of 17 March 2002, this (these hooks) are done. |
| 2112 | |
| 2113 | |
| 2114 | <p> |
| 2115 | </body> |
| 2116 | </html> |