| <html> |
| <head> |
| <style type="text/css"> |
| body { background-color: #ffffff; |
| color: #000000; |
| font-family: Times, Helvetica, Arial; |
| font-size: 14pt} |
| h4 { margin-bottom: 0.3em} |
| code { color: #000000; |
| font-family: Courier; |
| font-size: 13pt } |
| pre { color: #000000; |
| font-family: Courier; |
| font-size: 13pt } |
| a:link { color: #0000C0; |
| text-decoration: none; } |
| a:visited { color: #0000C0; |
| text-decoration: none; } |
| a:active { color: #0000C0; |
| text-decoration: none; } |
| </style> |
| <title>The design and implementation of Valgrind</title> |
| </head> |
| |
| <body bgcolor="#ffffff"> |
| |
| <a name="title"> </a> |
| <h1 align=center>The design and implementation of Valgrind</h1> |
| |
| <center> |
| Detailed technical notes for hackers, maintainers and the |
| overly-curious<br> |
| These notes pertain to snapshot 20020306<br> |
| <p> |
| <a href="mailto:jseward@acm.org">jseward@acm.org<br> |
| <a href="http://developer.kde.org/~sewardj">http://developer.kde.org/~sewardj</a><br> |
| <a href="http://www.muraroa.demon.co.uk">http://www.muraroa.demon.co.uk</a><br> |
| Copyright © 2000-2002 Julian Seward |
| <p> |
| Valgrind is licensed under the GNU General Public License, |
| version 2<br> |
| An open-source tool for finding memory-management problems in |
| x86 GNU/Linux executables. |
| </center> |
| |
| <p> |
| |
| |
| |
| |
| <hr width="100%"> |
| |
| <h2>Introduction</h2> |
| |
| This document contains a detailed, highly-technical description of the |
| internals of Valgrind. This is not the user manual; if you are an |
| end-user of Valgrind, you do not want to read this. Conversely, if |
| you really are a hacker-type and want to know how it works, I assume |
| that you have read the user manual thoroughly. |
| <p> |
| You may need to read this document several times, and carefully. Some |
| important things, I only say once. |
| |
| |
| <h3>History</h3> |
| |
| Valgrind came into public view in late Feb 2002. However, it has been |
| under contemplation for a very long time, perhaps seriously for about |
| five years. Somewhat over two years ago, I started working on the x86 |
| code generator for the Glasgow Haskell Compiler |
| (http://www.haskell.org/ghc), gaining familiarity with x86 internals |
| on the way. I then did Cacheprof (http://www.cacheprof.org), gaining |
| further x86 experience. Some time around Feb 2000 I started |
| experimenting with a user-space x86 interpreter for x86-Linux. This |
| worked, but it was clear that a JIT-based scheme would be necessary to |
| give reasonable performance for Valgrind. Design work for the JITter |
| started in earnest in Oct 2000, and by early 2001 I had an x86-to-x86 |
| dynamic translator which could run quite large programs. This |
| translator was in a sense pointless, since it did not do any |
| instrumentation or checking. |
| |
| <p> |
| Most of the rest of 2001 was taken up designing and implementing the |
| instrumentation scheme. The main difficulty, which consumed a lot |
| of effort, was to design a scheme which did not generate large numbers |
| of false uninitialised-value warnings. By late 2001 a satisfactory |
| scheme had been arrived at, and I started to test it on ever-larger |
| programs, with an eventual eye to making it work well enough so that |
| it was helpful to folks debugging the upcoming version 3 of KDE. I've |
| used KDE since before version 1.0, and wanted to Valgrind to be an |
| indirect contribution to the KDE 3 development effort. At the start of |
| Feb 02 the kde-core-devel crew started using it, and gave a huge |
| amount of helpful feedback and patches in the space of three weeks. |
| Snapshot 20020306 is the result. |
| |
| <p> |
| In the best Unix tradition, or perhaps in the spirit of Fred Brooks' |
| depressing-but-completely-accurate epitaph "build one to throw away; |
| you will anyway", much of Valgrind is a second or third rendition of |
| the initial idea. The instrumentation machinery |
| (<code>vg_translate.c</code>, <code>vg_memory.c</code>) and core CPU |
| simulation (<code>vg_to_ucode.c</code>, <code>vg_from_ucode.c</code>) |
| have had three redesigns and rewrites; the register allocator, |
| low-level memory manager (<code>vg_malloc2.c</code>) and symbol table |
| reader (<code>vg_symtab2.c</code>) are on the second rewrite. In a |
| sense, this document serves to record some of the knowledge gained as |
| a result. |
| |
| |
| <h3>Design overview</h3> |
| |
| Valgrind is compiled into a Linux shared object, |
| <code>valgrind.so</code>, and also a dummy one, |
| <code>valgrinq.so</code>, of which more later. The |
| <code>valgrind</code> shell script adds <code>valgrind.so</code> to |
| the <code>LD_PRELOAD</code> list of extra libraries to be |
| loaded with any dynamically linked library. This is a standard trick, |
| one which I assume the <code>LD_PRELOAD</code> mechanism was developed |
| to support. |
| |
| <p> |
| <code>valgrind.so</code> |
| is linked with the <code>-z initfirst</code> flag, which requests that |
| its initialisation code is run before that of any other object in the |
| executable image. When this happens, valgrind gains control. The |
| real CPU becomes "trapped" in <code>valgrind.so</code> and the |
| translations it generates. The synthetic CPU provided by Valgrind |
| does, however, return from this initialisation function. So the |
| normal startup actions, orchestrated by the dynamic linker |
| <code>ld.so</code>, continue as usual, except on the synthetic CPU, |
| not the real one. Eventually <code>main</code> is run and returns, |
| and then the finalisation code of the shared objects is run, |
| presumably in inverse order to which they were initialised. Remember, |
| this is still all happening on the simulated CPU. Eventually |
| <code>valgrind.so</code>'s own finalisation code is called. It spots |
| this event, shuts down the simulated CPU, prints any error summaries |
| and/or does leak detection, and returns from the initialisation code |
| on the real CPU. At this point, in effect the real and synthetic CPUs |
| have merged back into one, Valgrind has lost control of the program, |
| and the program finally <code>exit()s</code> back to the kernel in the |
| usual way. |
| |
| <p> |
| The normal course of activity, one Valgrind has started up, is as |
| follows. Valgrind never runs any part of your program (usually |
| referred to as the "client"), not a single byte of it, directly. |
| Instead it uses function <code>VG_(translate)</code> to translate |
| basic blocks (BBs, straight-line sequences of code) into instrumented |
| translations, and those are run instead. The translations are stored |
| in the translation cache (TC), <code>vg_tc</code>, with the |
| translation table (TT), <code>vg_tt</code> supplying the |
| original-to-translation code address mapping. Auxiliary array |
| <code>VG_(tt_fast)</code> is used as a direct-map cache for fast |
| lookups in TT; it usually achieves a hit rate of around 98% and |
| facilitates an orig-to-trans lookup in 4 x86 insns, which is not bad. |
| |
| <p> |
| Function <code>VG_(dispatch)</code> in <code>vg_dispatch.S</code> is |
| the heart of the JIT dispatcher. Once a translated code address has |
| been found, it is executed simply by an x86 <code>call</code> |
| to the translation. At the end of the translation, the next |
| original code addr is loaded into <code>%eax</code>, and the |
| translation then does a <code>ret</code>, taking it back to the |
| dispatch loop, with, interestingly, zero branch mispredictions. |
| The address requested in <code>%eax</code> is looked up first in |
| <code>VG_(tt_fast)</code>, and, if not found, by calling C helper |
| <code>VG_(search_transtab)</code>. If there is still no translation |
| available, <code>VG_(dispatch)</code> exits back to the top-level |
| C dispatcher <code>VG_(toploop)</code>, which arranges for |
| <code>VG_(translate)</code> to make a new translation. All fairly |
| unsurprising, really. There are various complexities described below. |
| |
| <p> |
| The translator, orchestrated by <code>VG_(translate)</code>, is |
| complicated but entirely self-contained. It is described in great |
| detail in subsequent sections. Translations are stored in TC, with TT |
| tracking administrative information. The translations are subject to |
| an approximate LRU-based management scheme. With the current |
| settings, the TC can hold at most about 15MB of translations, and LRU |
| passes prune it to about 13.5MB. Given that the |
| orig-to-translation expansion ratio is about 13:1 to 14:1, this means |
| TC holds translations for more or less a megabyte of original code, |
| which generally comes to about 70000 basic blocks for C++ compiled |
| with optimisation on. Generating new translations is expensive, so it |
| is worth having a large TC to minimise the (capacity) miss rate. |
| |
| <p> |
| The dispatcher, <code>VG_(dispatch)</code>, receives hints from |
| the translations which allow it to cheaply spot all control |
| transfers corresponding to x86 <code>call</code> and <code>ret</code> |
| instructions. It has to do this in order to spot some special events: |
| <ul> |
| <li>Calls to <code>VG_(shutdown)</code>. This is Valgrind's cue to |
| exit. NOTE: actually this is done a different way; it should be |
| cleaned up. |
| <p> |
| <li>Returns of system call handlers, to the return address |
| <code>VG_(signalreturn_bogusRA)</code>. The signal simulator |
| needs to know when a signal handler is returning, so we spot |
| jumps (returns) to this address. |
| <p> |
| <li>Calls to <code>vg_trap_here</code>. All <code>malloc</code>, |
| <code>free</code>, etc calls that the client program makes are |
| eventually routed to a call to <code>vg_trap_here</code>, |
| and Valgrind does its own special thing with these calls. |
| In effect this provides a trapdoor, by which Valgrind can |
| intercept certain calls on the simulated CPU, run the call as it |
| sees fit itself (on the real CPU), and return the result to |
| the simulated CPU, quite transparently to the client program. |
| </ul> |
| Valgrind intercepts the client's <code>malloc</code>, |
| <code>free</code>, etc, |
| calls, so that it can store additional information. Each block |
| <code>malloc</code>'d by the client gives rise to a shadow block |
| in which Valgrind stores the call stack at the time of the |
| <code>malloc</code> |
| call. When the client calls <code>free</code>, Valgrind tries to |
| find the shadow block corresponding to the address passed to |
| <code>free</code>, and emits an error message if none can be found. |
| If it is found, the block is placed on the freed blocks queue |
| <code>vg_freed_list</code>, it is marked as inaccessible, and |
| its shadow block now records the call stack at the time of the |
| <code>free</code> call. Keeping <code>free</code>'d blocks in |
| this queue allows Valgrind to spot all (presumably invalid) accesses |
| to them. However, once the volume of blocks in the free queue |
| exceeds <code>VG_(clo_freelist_vol)</code>, blocks are finally |
| removed from the queue. |
| |
| <p> |
| Keeping track of A and V bits (note: if you don't know what these are, |
| you haven't read the user guide carefully enough) for memory is done |
| in <code>vg_memory.c</code>. This implements a sparse array structure |
| which covers the entire 4G address space in a way which is reasonably |
| fast and reasonably space efficient. The 4G address space is divided |
| up into 64K sections, each covering 64Kb of address space. Given a |
| 32-bit address, the top 16 bits are used to select one of the 65536 |
| entries in <code>VG_(primary_map)</code>. The resulting "secondary" |
| (<code>SecMap</code>) holds A and V bits for the 64k of address space |
| chunk corresponding to the lower 16 bits of the address. |
| |
| |
| <h3>Design decisions</h3> |
| |
| Some design decisions were motivated by the need to make Valgrind |
| debuggable. Imagine you are writing a CPU simulator. It works fairly |
| well. However, you run some large program, like Netscape, and after |
| tens of millions of instructions, it crashes. How can you figure out |
| where in your simulator the bug is? |
| |
| <p> |
| Valgrind's answer is: cheat. Valgrind is designed so that it is |
| possible to switch back to running the client program on the real |
| CPU at any point. Using the <code>--stop-after= </code> flag, you can |
| ask Valgrind to run just some number of basic blocks, and then |
| run the rest of the way on the real CPU. If you are searching for |
| a bug in the simulated CPU, you can use this to do a binary search, |
| which quickly leads you to the specific basic block which is |
| causing the problem. |
| |
| <p> |
| This is all very handy. It does constrain the design in certain |
| unimportant ways. Firstly, the layout of memory, when viewed from the |
| client's point of view, must be identical regardless of whether it is |
| running on the real or simulated CPU. This means that Valgrind can't |
| do pointer swizzling -- well, no great loss -- and it can't run on |
| the same stack as the client -- again, no great loss. |
| Valgrind operates on its own stack, <code>VG_(stack)</code>, which |
| it switches to at startup, temporarily switching back to the client's |
| stack when doing system calls for the client. |
| |
| <p> |
| Valgrind also receives signals on its own stack, |
| <code>VG_(sigstack)</code>, but for different gruesome reasons |
| discussed below. |
| |
| <p> |
| This nice clean switch-back-to-the-real-CPU-whenever-you-like story |
| is muddied by signals. Problem is that signals arrive at arbitrary |
| times and tend to slightly perturb the basic block count, with the |
| result that you can get close to the basic block causing a problem but |
| can't home in on it exactly. My kludgey hack is to define |
| <code>SIGNAL_SIMULATION</code> to 1 towards the bottom of |
| <code>vg_syscall_mem.c</code>, so that signal handlers are run on the |
| real CPU and don't change the BB counts. |
| |
| <p> |
| A second hole in the switch-back-to-real-CPU story is that Valgrind's |
| way of delivering signals to the client is different from that of the |
| kernel. Specifically, the layout of the signal delivery frame, and |
| the mechanism used to detect a sighandler returning, are different. |
| So you can't expect to make the transition inside a sighandler and |
| still have things working, but in practice that's not much of a |
| restriction. |
| |
| <p> |
| Valgrind's implementation of <code>malloc</code>, <code>free</code>, |
| etc, (in <code>vg_clientmalloc.c</code>, not the low-level stuff in |
| <code>vg_malloc2.c</code>) is somewhat complicated by the need to |
| handle switching back at arbitrary points. It does work tho. |
| |
| |
| |
| <h3>Correctness</h3> |
| |
| There's only one of me, and I have a Real Life (tm) as well as hacking |
| Valgrind [allegedly :-]. That means I don't have time to waste |
| chasing endless bugs in Valgrind. My emphasis is therefore on doing |
| everything as simply as possible, with correctness, stability and |
| robustness being the number one priority, more important than |
| performance or functionality. As a result: |
| <ul> |
| <li>The code is absolutely loaded with assertions, and these are |
| <b>permanently enabled.</b> I have no plan to remove or disable |
| them later. Over the past couple of months, as valgrind has |
| become more widely used, they have shown their worth, pulling |
| up various bugs which would otherwise have appeared as |
| hard-to-find segmentation faults. |
| <p> |
| I am of the view that it's acceptable to spend 5% of the total |
| running time of your valgrindified program doing assertion checks |
| and other internal sanity checks. |
| <p> |
| <li>Aside from the assertions, valgrind contains various sets of |
| internal sanity checks, which get run at varying frequencies |
| during normal operation. <code>VG_(do_sanity_checks)</code> |
| runs every 1000 basic blocks, which means 500 to 2000 times/second |
| for typical machines at present. It checks that Valgrind hasn't |
| overrun its private stack, and does some simple checks on the |
| memory permissions maps. Once every 25 calls it does some more |
| extensive checks on those maps. Etc, etc. |
| <p> |
| The following components also have sanity check code, which can |
| be enabled to aid debugging: |
| <ul> |
| <li>The low-level memory-manager |
| (<code>VG_(mallocSanityCheckArena)</code>). This does a |
| complete check of all blocks and chains in an arena, which |
| is very slow. Is not engaged by default. |
| <p> |
| <li>The symbol table reader(s): various checks to ensure |
| uniqueness of mappings; see <code>VG_(read_symbols)</code> |
| for a start. Is permanently engaged. |
| <p> |
| <li>The A and V bit tracking stuff in <code>vg_memory.c</code>. |
| This can be compiled with cpp symbol |
| <code>VG_DEBUG_MEMORY</code> defined, which removes all the |
| fast, optimised cases, and uses simple-but-slow fallbacks |
| instead. Not engaged by default. |
| <p> |
| <li>Ditto <code>VG_DEBUG_LEAKCHECK</code>. |
| <p> |
| <li>The JITter parses x86 basic blocks into sequences of |
| UCode instructions. It then sanity checks each one with |
| <code>VG_(saneUInstr)</code> and sanity checks the sequence |
| as a whole with <code>VG_(saneUCodeBlock)</code>. This stuff |
| is engaged by default, and has caught some way-obscure bugs |
| in the simulated CPU machinery in its time. |
| <p> |
| <li>The system call wrapper does |
| <code>VG_(first_and_last_secondaries_look_plausible)</code> after |
| every syscall; this is known to pick up bugs in the syscall |
| wrappers. Engaged by default. |
| <p> |
| <li>The main dispatch loop, in <code>VG_(dispatch)</code>, checks |
| that translations do not set <code>%ebp</code> to any value |
| different from <code>VG_EBP_DISPATCH_CHECKED</code> or |
| <code>& VG_(baseBlock)</code>. In effect this test is free, |
| and is permanently engaged. |
| <p> |
| <li>There are a couple of ifdefed-out consistency checks I |
| inserted whilst debugging the new register allocater, |
| <code>vg_do_register_allocation</code>. |
| </ul> |
| <p> |
| <li>I try to avoid techniques, algorithms, mechanisms, etc, for which |
| I can supply neither a convincing argument that they are correct, |
| nor sanity-check code which might pick up bugs in my |
| implementation. I don't always succeed in this, but I try. |
| Basically the idea is: avoid techniques which are, in practice, |
| unverifiable, in some sense. When doing anything, always have in |
| mind: "how can I verify that this is correct?" |
| </ul> |
| |
| <p> |
| Some more specific things are: |
| |
| <ul> |
| <li>Valgrind runs in the same namespace as the client, at least from |
| <code>ld.so</code>'s point of view, and it therefore absolutely |
| had better not export any symbol with a name which could clash |
| with that of the client or any of its libraries. Therefore, all |
| globally visible symbols exported from <code>valgrind.so</code> |
| are defined using the <code>VG_</code> CPP macro. As you'll see |
| from <code>vg_constants.h</code>, this appends some arbitrary |
| prefix to the symbol, in order that it be, we hope, globally |
| unique. Currently the prefix is <code>vgPlain_</code>. For |
| convenience there are also <code>VGM_</code>, <code>VGP_</code> |
| and <code>VGOFF_</code>. All locally defined symbols are declared |
| <code>static</code> and do not appear in the final shared object. |
| <p> |
| To check this, I periodically do |
| <code>nm valgrind.so | grep " T "</code>, |
| which shows you all the globally exported text symbols. |
| They should all have an approved prefix, except for those like |
| <code>malloc</code>, <code>free</code>, etc, which we deliberately |
| want to shadow and take precedence over the same names exported |
| from <code>glibc.so</code>, so that valgrind can intercept those |
| calls easily. Similarly, <code>nm valgrind.so | grep " D "</code> |
| allows you to find any rogue data-segment symbol names. |
| <p> |
| <li>Valgrind tries, and almost succeeds, in being completely |
| independent of all other shared objects, in particular of |
| <code>glibc.so</code>. For example, we have our own low-level |
| memory manager in <code>vg_malloc2.c</code>, which is a fairly |
| standard malloc/free scheme augmented with arenas, and |
| <code>vg_mylibc.c</code> exports reimplementations of various bits |
| and pieces you'd normally get from the C library. |
| <p> |
| Why all the hassle? Because imagine the potential chaos of both |
| the simulated and real CPUs executing in <code>glibc.so</code>. |
| It just seems simpler and cleaner to be completely self-contained, |
| so that only the simulated CPU visits <code>glibc.so</code>. In |
| practice it's not much hassle anyway. Also, valgrind starts up |
| before glibc has a chance to initialise itself, and who knows what |
| difficulties that could lead to. Finally, glibc has definitions |
| for some types, specifically <code>sigset_t</code>, which conflict |
| (are different from) the Linux kernel's idea of same. When |
| Valgrind wants to fiddle around with signal stuff, it wants to |
| use the kernel's definitions, not glibc's definitions. So it's |
| simplest just to keep glibc out of the picture entirely. |
| <p> |
| To find out which glibc symbols are used by Valgrind, reinstate |
| the link flags <code>-nostdlib -Wl,-no-undefined</code>. This |
| causes linking to fail, but will tell you what you depend on. |
| I have mostly, but not entirely, got rid of the glibc |
| dependencies; what remains is, IMO, fairly harmless. AFAIK the |
| current dependencies are: <code>memset</code>, |
| <code>memcmp</code>, <code>stat</code>, <code>system</code>, |
| <code>sbrk</code>, <code>setjmp</code> and <code>longjmp</code>. |
| |
| <p> |
| <li>Similarly, valgrind should not really import any headers other |
| than the Linux kernel headers, since it knows of no API other than |
| the kernel interface to talk to. At the moment this is really not |
| in a good state, and <code>vg_syscall_mem</code> imports, via |
| <code>vg_unsafe.h</code>, a significant number of C-library |
| headers so as to know the sizes of various structs passed across |
| the kernel boundary. This is of course completely bogus, since |
| there is no guarantee that the C library's definitions of these |
| structs matches those of the kernel. I have started to sort this |
| out using <code>vg_kerneliface.h</code>, into which I had intended |
| to copy all kernel definitions which valgrind could need, but this |
| has not gotten very far. At the moment it mostly contains |
| definitions for <code>sigset_t</code> and <code>struct |
| sigaction</code>, since the kernel's definition for these really |
| does clash with glibc's. I plan to use a <code>vki_</code> prefix |
| on all these types and constants, to denote the fact that they |
| pertain to <b>V</b>algrind's <b>K</b>ernel <b>I</b>nterface. |
| <p> |
| Another advantage of having a <code>vg_kerneliface.h</code> file |
| is that it makes it simpler to interface to a different kernel. |
| Once can, for example, easily imagine writing a new |
| <code>vg_kerneliface.h</code> for FreeBSD, or x86 NetBSD. |
| |
| </ul> |
| |
| <h3>Current limitations</h3> |
| |
| No threads. I think fixing this is close to a research-grade problem. |
| <p> |
| No MMX. Fixing this should be relatively easy, using the same giant |
| trick used for x86 FPU instructions. See below. |
| <p> |
| Support for weird (non-POSIX) signal stuff is patchy. Does anybody |
| care? |
| <p> |
| |
| |
| |
| |
| <hr width="100%"> |
| |
| <h2>The instrumenting JITter</h2> |
| |
| This really is the heart of the matter. We begin with various side |
| issues. |
| |
| <h3>Run-time storage, and the use of host registers</h3> |
| |
| Valgrind translates client (original) basic blocks into instrumented |
| basic blocks, which live in the translation cache TC, until either the |
| client finishes or the translations are ejected from TC to make room |
| for newer ones. |
| <p> |
| Since it generates x86 code in memory, Valgrind has complete control |
| of the use of registers in the translations. Now pay attention. I |
| shall say this only once, and it is important you understand this. In |
| what follows I will refer to registers in the host (real) cpu using |
| their standard names, <code>%eax</code>, <code>%edi</code>, etc. I |
| refer to registers in the simulated CPU by capitalising them: |
| <code>%EAX</code>, <code>%EDI</code>, etc. These two sets of |
| registers usually bear no direct relationship to each other; there is |
| no fixed mapping between them. This naming scheme is used fairly |
| consistently in the comments in the sources. |
| <p> |
| Host registers, once things are up and running, are used as follows: |
| <ul> |
| <li><code>%esp</code>, the real stack pointer, points |
| somewhere in Valgrind's private stack area, |
| <code>VG_(stack)</code> or, transiently, into its signal delivery |
| stack, <code>VG_(sigstack)</code>. |
| <p> |
| <li><code>%edi</code> is used as a temporary in code generation; it |
| is almost always dead, except when used for the <code>Left</code> |
| value-tag operations. |
| <p> |
| <li><code>%eax</code>, <code>%ebx</code>, <code>%ecx</code>, |
| <code>%edx</code> and <code>%esi</code> are available to |
| Valgrind's register allocator. They are dead (carry unimportant |
| values) in between translations, and are live only in |
| translations. The one exception to this is <code>%eax</code>, |
| which, as mentioned far above, has a special significance to the |
| dispatch loop <code>VG_(dispatch)</code>: when a translation |
| returns to the dispatch loop, <code>%eax</code> is expected to |
| contain the original-code-address of the next translation to run. |
| The register allocator is so good at minimising spill code that |
| using five regs and not having to save/restore <code>%edi</code> |
| actually gives better code than allocating to <code>%edi</code> |
| as well, but then having to push/pop it around special uses. |
| <p> |
| <li><code>%ebp</code> points permanently at |
| <code>VG_(baseBlock)</code>. Valgrind's translations are |
| position-independent, partly because this is convenient, but also |
| because translations get moved around in TC as part of the LRUing |
| activity. <b>All</b> static entities which need to be referred to |
| from generated code, whether data or helper functions, are stored |
| starting at <code>VG_(baseBlock)</code> and are therefore reached |
| by indexing from <code>%ebp</code>. There is but one exception, |
| which is that by placing the value |
| <code>VG_EBP_DISPATCH_CHECKED</code> |
| in <code>%ebp</code> just before a return to the dispatcher, |
| the dispatcher is informed that the next address to run, |
| in <code>%eax</code>, requires special treatment. |
| <p> |
| <li>The real machine's FPU state is pretty much unimportant, for |
| reasons which will become obvious. Ditto its <code>%eflags</code> |
| register. |
| </ul> |
| |
| <p> |
| The state of the simulated CPU is stored in memory, in |
| <code>VG_(baseBlock)</code>, which is a block of 200 words IIRC. |
| Recall that <code>%ebp</code> points permanently at the start of this |
| block. Function <code>vg_init_baseBlock</code> decides what the |
| offsets of various entities in <code>VG_(baseBlock)</code> are to be, |
| and allocates word offsets for them. The code generator then emits |
| <code>%ebp</code> relative addresses to get at those things. The |
| sequence in which entities are allocated has been carefully chosen so |
| that the 32 most popular entities come first, because this means 8-bit |
| offsets can be used in the generated code. |
| |
| <p> |
| If I was clever, I could make <code>%ebp</code> point 32 words along |
| <code>VG_(baseBlock)</code>, so that I'd have another 32 words of |
| short-form offsets available, but that's just complicated, and it's |
| not important -- the first 32 words take 99% (or whatever) of the |
| traffic. |
| |
| <p> |
| Currently, the sequence of stuff in <code>VG_(baseBlock)</code> is as |
| follows: |
| <ul> |
| <li>9 words, holding the simulated integer registers, |
| <code>%EAX</code> .. <code>%EDI</code>, and the simulated flags, |
| <code>%EFLAGS</code>. |
| <p> |
| <li>Another 9 words, holding the V bit "shadows" for the above 9 regs. |
| <p> |
| <li>The <b>addresses</b> of various helper routines called from |
| generated code: |
| <code>VG_(helper_value_check4_fail)</code>, |
| <code>VG_(helper_value_check0_fail)</code>, |
| which register V-check failures, |
| <code>VG_(helperc_STOREV4)</code>, |
| <code>VG_(helperc_STOREV1)</code>, |
| <code>VG_(helperc_LOADV4)</code>, |
| <code>VG_(helperc_LOADV1)</code>, |
| which do stores and loads of V bits to/from the |
| sparse array which keeps track of V bits in memory, |
| and |
| <code>VGM_(handle_esp_assignment)</code>, which messes with |
| memory addressibility resulting from changes in <code>%ESP</code>. |
| <p> |
| <li>The simulated <code>%EIP</code>. |
| <p> |
| <li>24 spill words, for when the register allocator can't make it work |
| with 5 measly registers. |
| <p> |
| <li>Addresses of helpers <code>VG_(helperc_STOREV2)</code>, |
| <code>VG_(helperc_LOADV2)</code>. These are here because 2-byte |
| loads and stores are relatively rare, so are placed above the |
| magic 32-word offset boundary. |
| <p> |
| <li>For similar reasons, addresses of helper functions |
| <code>VGM_(fpu_write_check)</code> and |
| <code>VGM_(fpu_read_check)</code>, which handle the A/V maps |
| testing and changes required by FPU writes/reads. |
| <p> |
| <li>Some other boring helper addresses: |
| <code>VG_(helper_value_check2_fail)</code> and |
| <code>VG_(helper_value_check1_fail)</code>. These are probably |
| never emitted now, and should be removed. |
| <p> |
| <li>The entire state of the simulated FPU, which I believe to be |
| 108 bytes long. |
| <p> |
| <li>Finally, the addresses of various other helper functions in |
| <code>vg_helpers.S</code>, which deal with rare situations which |
| are tedious or difficult to generate code in-line for. |
| </ul> |
| |
| <p> |
| As a general rule, the simulated machine's state lives permanently in |
| memory at <code>VG_(baseBlock)</code>. However, the JITter does some |
| optimisations which allow the simulated integer registers to be |
| cached in real registers over multiple simulated instructions within |
| the same basic block. These are always flushed back into memory at |
| the end of every basic block, so that the in-memory state is |
| up-to-date between basic blocks. (This flushing is implied by the |
| statement above that the real machine's allocatable registers are |
| dead in between simulated blocks). |
| |
| |
| <h3>Startup, shutdown, and system calls</h3> |
| |
| Getting into of Valgrind (<code>VG_(startup)</code>, called from |
| <code>valgrind.so</code>'s initialisation section), really means |
| copying the real CPU's state into <code>VG_(baseBlock)</code>, and |
| then installing our own stack pointer, etc, into the real CPU, and |
| then starting up the JITter. Exiting valgrind involves copying the |
| simulated state back to the real state. |
| |
| <p> |
| Unfortunately, there's a complication at startup time. Problem is |
| that at the point where we need to take a snapshot of the real CPU's |
| state, the offsets in <code>VG_(baseBlock)</code> are not set up yet, |
| because to do so would involve disrupting the real machine's state |
| significantly. The way round this is to dump the real machine's state |
| into a temporary, static block of memory, |
| <code>VG_(m_state_static)</code>. We can then set up the |
| <code>VG_(baseBlock)</code> offsets at our leisure, and copy into it |
| from <code>VG_(m_state_static)</code> at some convenient later time. |
| This copying is done by |
| <code>VG_(copy_m_state_static_to_baseBlock)</code>. |
| |
| <p> |
| On exit, the inverse transformation is (rather unnecessarily) used: |
| stuff in <code>VG_(baseBlock)</code> is copied to |
| <code>VG_(m_state_static)</code>, and the assembly stub then copies |
| from <code>VG_(m_state_static)</code> into the real machine registers. |
| |
| <p> |
| Doing system calls on behalf of the client (<code>vg_syscall.S</code>) |
| is something of a half-way house. We have to make the world look |
| sufficiently like that which the client would normally have to make |
| the syscall actually work properly, but we can't afford to lose |
| control. So the trick is to copy all of the client's state, <b>except |
| its program counter</b>, into the real CPU, do the system call, and |
| copy the state back out. Note that the client's state includes its |
| stack pointer register, so one effect of this partial restoration is |
| to cause the system call to be run on the client's stack, as it should |
| be. |
| |
| <p> |
| As ever there are complications. We have to save some of our own state |
| somewhere when restoring the client's state into the CPU, so that we |
| can keep going sensibly afterwards. In fact the only thing which is |
| important is our own stack pointer, but for paranoia reasons I save |
| and restore our own FPU state as well, even though that's probably |
| pointless. |
| |
| <p> |
| The complication on the above complication is, that for horrible |
| reasons to do with signals, we may have to handle a second client |
| system call whilst the client is blocked inside some other system |
| call (unbelievable!). That means there's two sets of places to |
| dump Valgrind's stack pointer and FPU state across the syscall, |
| and we decide which to use by consulting |
| <code>VG_(syscall_depth)</code>, which is in turn maintained by |
| <code>VG_(wrap_syscall)</code>. |
| |
| |
| |
| <h3>Introduction to UCode</h3> |
| |
| UCode lies at the heart of the x86-to-x86 JITter. The basic premise |
| is that dealing the the x86 instruction set head-on is just too darn |
| complicated, so we do the traditional compiler-writer's trick and |
| translate it into a simpler, easier-to-deal-with form. |
| |
| <p> |
| In normal operation, translation proceeds through six stages, |
| coordinated by <code>VG_(translate)</code>: |
| <ol> |
| <li>Parsing of an x86 basic block into a sequence of UCode |
| instructions (<code>VG_(disBB)</code>). |
| <p> |
| <li>UCode optimisation (<code>vg_improve</code>), with the aim of |
| caching simulated registers in real registers over multiple |
| simulated instructions, and removing redundant simulated |
| <code>%EFLAGS</code> saving/restoring. |
| <p> |
| <li>UCode instrumentation (<code>vg_instrument</code>), which adds |
| value and address checking code. |
| <p> |
| <li>Post-instrumentation cleanup (<code>vg_cleanup</code>), removing |
| redundant value-check computations. |
| <p> |
| <li>Register allocation (<code>vg_do_register_allocation</code>), |
| which, note, is done on UCode. |
| <p> |
| <li>Emission of final instrumented x86 code |
| (<code>VG_(emit_code)</code>). |
| </ol> |
| |
| <p> |
| Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode |
| transformation passes, all on straight-line blocks of UCode (type |
| <code>UCodeBlock</code>). Steps 2 and 4 are optimisation passes and |
| can be disabled for debugging purposes, with |
| <code>--optimise=no</code> and <code>--cleanup=no</code> respectively. |
| |
| <p> |
| Valgrind can also run in a no-instrumentation mode, given |
| <code>--instrument=no</code>. This is useful for debugging the JITter |
| quickly without having to deal with the complexity of the |
| instrumentation mechanism too. In this mode, steps 3 and 4 are |
| omitted. |
| |
| <p> |
| These flags combine, so that <code>--instrument=no</code> together with |
| <code>--optimise=no</code> means only steps 1, 5 and 6 are used. |
| <code>--single-step=yes</code> causes each x86 instruction to be |
| treated as a single basic block. The translations are terrible but |
| this is sometimes instructive. |
| |
| <p> |
| The <code>--stop-after=N</code> flag switches back to the real CPU |
| after <code>N</code> basic blocks. It also re-JITs the final basic |
| block executed and prints the debugging info resulting, so this |
| gives you a way to get a quick snapshot of how a basic block looks as |
| it passes through the six stages mentioned above. If you want to |
| see full information for every block translated (probably not, but |
| still ...) find, in <code>VG_(translate)</code>, the lines |
| <br><code> dis = True;</code> |
| <br><code> dis = debugging_translation;</code> |
| <br> |
| and comment out the second line. This will spew out debugging |
| junk faster than you can possibly imagine. |
| |
| |
| |
| <h3>UCode operand tags: type <code>Tag</code></h3> |
| |
| UCode is, more or less, a simple two-address RISC-like code. In |
| keeping with the x86 AT&T assembly syntax, generally speaking the |
| first operand is the source operand, and the second is the destination |
| operand, which is modified when the uinstr is notionally executed. |
| |
| <p> |
| UCode instructions have up to three operand fields, each of which has |
| a corresponding <code>Tag</code> describing it. Possible values for |
| the tag are: |
| |
| <ul> |
| <li><code>NoValue</code>: indicates that the field is not in use. |
| <p> |
| <li><code>Lit16</code>: the field contains a 16-bit literal. |
| <p> |
| <li><code>Literal</code>: the field denotes a 32-bit literal, whose |
| value is stored in the <code>lit32</code> field of the uinstr |
| itself. Since there is only one <code>lit32</code> for the whole |
| uinstr, only one operand field may contain this tag. |
| <p> |
| <li><code>SpillNo</code>: the field contains a spill slot number, in |
| the range 0 to 23 inclusive, denoting one of the spill slots |
| contained inside <code>VG_(baseBlock)</code>. Such tags only |
| exist after register allocation. |
| <p> |
| <li><code>RealReg</code>: the field contains a number in the range 0 |
| to 7 denoting an integer x86 ("real") register on the host. The |
| number is the Intel encoding for integer registers. Such tags |
| only exist after register allocation. |
| <p> |
| <li><code>ArchReg</code>: the field contains a number in the range 0 |
| to 7 denoting an integer x86 register on the simulated CPU. In |
| reality this means a reference to one of the first 8 words of |
| <code>VG_(baseBlock)</code>. Such tags can exist at any point in |
| the translation process. |
| <p> |
| <li>Last, but not least, <code>TempReg</code>. The field contains the |
| number of one of an infinite set of virtual (integer) |
| registers. <code>TempReg</code>s are used everywhere throughout |
| the translation process; you can have as many as you want. The |
| register allocator maps as many as it can into |
| <code>RealReg</code>s and turns the rest into |
| <code>SpillNo</code>s, so <code>TempReg</code>s should not exist |
| after the register allocation phase. |
| <p> |
| <code>TempReg</code>s are always 32 bits long, even if the data |
| they hold is logically shorter. In that case the upper unused |
| bits are required, and, I think, generally assumed, to be zero. |
| <code>TempReg</code>s holding V bits for quantities shorter than |
| 32 bits are expected to have ones in the unused places, since a |
| one denotes "undefined". |
| </ul> |
| |
| |
| <h3>UCode instructions: type <code>UInstr</code></h3> |
| |
| <p> |
| UCode was carefully designed to make it possible to do register |
| allocation on UCode and then translate the result into x86 code |
| without needing any extra registers ... well, that was the original |
| plan, anyway. Things have gotten a little more complicated since |
| then. In what follows, UCode instructions are referred to as uinstrs, |
| to distinguish them from x86 instructions. Uinstrs of course have |
| uopcodes which are (naturally) different from x86 opcodes. |
| |
| <p> |
| A uinstr (type <code>UInstr</code>) contains |
| various fields, not all of which are used by any one uopcode: |
| <ul> |
| <li>Three 16-bit operand fields, <code>val1</code>, <code>val2</code> |
| and <code>val3</code>. |
| <p> |
| <li>Three tag fields, <code>tag1</code>, <code>tag2</code> |
| and <code>tag3</code>. Each of these has a value of type |
| <code>Tag</code>, |
| and they describe what the <code>val1</code>, <code>val2</code> |
| and <code>val3</code> fields contain. |
| <p> |
| <li>A 32-bit literal field. |
| <p> |
| <li>Two <code>FlagSet</code>s, specifying which x86 condition codes are |
| read and written by the uinstr. |
| <p> |
| <li>An opcode byte, containing a value of type <code>Opcode</code>. |
| <p> |
| <li>A size field, indicating the data transfer size (1/2/4/8/10) in |
| cases where this makes sense, or zero otherwise. |
| <p> |
| <li>A condition-code field, which, for jumps, holds a |
| value of type <code>Condcode</code>, indicating the condition |
| which applies. The encoding is as it is in the x86 insn stream, |
| except we add a 17th value <code>CondAlways</code> to indicate |
| an unconditional transfer. |
| <p> |
| <li>Various 1-bit flags, indicating whether this insn pertains to an |
| x86 CALL or RET instruction, whether a widening is signed or not, |
| etc. |
| </ul> |
| |
| <p> |
| UOpcodes (type <code>Opcode</code>) are divided into two groups: those |
| necessary merely to express the functionality of the x86 code, and |
| extra uopcodes needed to express the instrumentation. The former |
| group contains: |
| <ul> |
| <li><code>GET</code> and <code>PUT</code>, which move values from the |
| simulated CPU's integer registers (<code>ArchReg</code>s) into |
| <code>TempReg</code>s, and back. <code>GETF</code> and |
| <code>PUTF</code> do the corresponding thing for the simulated |
| <code>%EFLAGS</code>. There are no corresponding insns for the |
| FPU register stack, since we don't explicitly simulate its |
| registers. |
| <p> |
| <li><code>LOAD</code> and <code>STORE</code>, which, in RISC-like |
| fashion, are the only uinstrs able to interact with memory. |
| <p> |
| <li><code>MOV</code> and <code>CMOV</code> allow unconditional and |
| conditional moves of values between <code>TempReg</code>s. |
| <p> |
| <li>ALU operations. Again in RISC-like fashion, these only operate on |
| <code>TempReg</code>s (before reg-alloc) or <code>RealReg</code>s |
| (after reg-alloc). These are: <code>ADD</code>, <code>ADC</code>, |
| <code>AND</code>, <code>OR</code>, <code>XOR</code>, |
| <code>SUB</code>, <code>SBB</code>, <code>SHL</code>, |
| <code>SHR</code>, <code>SAR</code>, <code>ROL</code>, |
| <code>ROR</code>, <code>RCL</code>, <code>RCR</code>, |
| <code>NOT</code>, <code>NEG</code>, <code>INC</code>, |
| <code>DEC</code>, <code>BSWAP</code>, <code>CC2VAL</code> and |
| <code>WIDEN</code>. <code>WIDEN</code> does signed or unsigned |
| value widening. <code>CC2VAL</code> is used to convert condition |
| codes into a value, zero or one. The rest are obvious. |
| <p> |
| To allow for more efficient code generation, we bend slightly the |
| restriction at the start of the previous para: for |
| <code>ADD</code>, <code>ADC</code>, <code>XOR</code>, |
| <code>SUB</code> and <code>SBB</code>, we allow the first (source) |
| operand to also be an <code>ArchReg</code>, that is, one of the |
| simulated machine's registers. Also, many of these ALU ops allow |
| the source operand to be a literal. See |
| <code>VG_(saneUInstr)</code> for the final word on the allowable |
| forms of uinstrs. |
| <p> |
| <li><code>LEA1</code> and <code>LEA2</code> are not strictly |
| necessary, but allow faciliate better translations. They |
| record the fancy x86 addressing modes in a direct way, which |
| allows those amodes to be emitted back into the final |
| instruction stream more or less verbatim. |
| <p> |
| <li><code>CALLM</code> calls a machine-code helper, one of the methods |
| whose address is stored at some <code>VG_(baseBlock)</code> |
| offset. <code>PUSH</code> and <code>POP</code> move values |
| to/from <code>TempReg</code> to the real (Valgrind's) stack, and |
| <code>CLEAR</code> removes values from the stack. |
| <code>CALLM_S</code> and <code>CALLM_E</code> delimit the |
| boundaries of call setups and clearings, for the benefit of the |
| instrumentation passes. Getting this right is critical, and so |
| <code>VG_(saneUCodeBlock)</code> makes various checks on the use |
| of these uopcodes. |
| <p> |
| It is important to understand that these uopcodes have nothing to |
| do with the x86 <code>call</code>, <code>return,</code> |
| <code>push</code> or <code>pop</code> instructions, and are not |
| used to implement them. Those guys turn into combinations of |
| <code>GET</code>, <code>PUT</code>, <code>LOAD</code>, |
| <code>STORE</code>, <code>ADD</code>, <code>SUB</code>, and |
| <code>JMP</code>. What these uopcodes support is calling of |
| helper functions such as <code>VG_(helper_imul_32_64)</code>, |
| which do stuff which is too difficult or tedious to emit inline. |
| <p> |
| <li><code>FPU</code>, <code>FPU_R</code> and <code>FPU_W</code>. |
| Valgrind doesn't attempt to simulate the internal state of the |
| FPU at all. Consequently it only needs to be able to distinguish |
| FPU ops which read and write memory from those that don't, and |
| for those which do, it needs to know the effective address and |
| data transfer size. This is made easier because the x86 FP |
| instruction encoding is very regular, basically consisting of |
| 16 bits for a non-memory FPU insn and 11 (IIRC) bits + an address mode |
| for a memory FPU insn. So our <code>FPU</code> uinstr carries |
| the 16 bits in its <code>val1</code> field. And |
| <code>FPU_R</code> and <code>FPU_W</code> carry 11 bits in that |
| field, together with the identity of a <code>TempReg</code> or |
| (later) <code>RealReg</code> which contains the address. |
| <p> |
| <li><code>JIFZ</code> is unique, in that it allows a control-flow |
| transfer which is not deemed to end a basic block. It causes a |
| jump to a literal (original) address if the specified argument |
| is zero. |
| <p> |
| <li>Finally, <code>INCEIP</code> advances the simulated |
| <code>%EIP</code> by the specified literal amount. This supports |
| lazy <code>%EIP</code> updating, as described below. |
| </ul> |
| |
| <p> |
| Stages 1 and 2 of the 6-stage translation process mentioned above |
| deal purely with these uopcodes, and no others. They are |
| sufficient to express pretty much all the x86 32-bit protected-mode |
| instruction set, at |
| least everything understood by a pre-MMX original Pentium (P54C). |
| |
| <p> |
| Stages 3, 4, 5 and 6 also deal with the following extra |
| "instrumentation" uopcodes. They are used to express all the |
| definedness-tracking and -checking machinery which valgrind does. In |
| later sections we show how to create checking code for each of the |
| uopcodes above. Note that these instrumentation uopcodes, although |
| some appearing complicated, have been carefully chosen so that |
| efficient x86 code can be generated for them. GNU superopt v2.5 did a |
| great job helping out here. Anyways, the uopcodes are as follows: |
| |
| <ul> |
| <li><code>GETV</code> and <code>PUTV</code> are analogues to |
| <code>GET</code> and <code>PUT</code> above. They are identical |
| except that they move the V bits for the specified values back and |
| forth to <code>TempRegs</code>, rather than moving the values |
| themselves. |
| <p> |
| <li>Similarly, <code>LOADV</code> and <code>STOREV</code> read and |
| write V bits from the synthesised shadow memory that Valgrind |
| maintains. In fact they do more than that, since they also do |
| address-validity checks, and emit complaints if the read/written |
| addresses are unaddressible. |
| <p> |
| <li><code>TESTV</code>, whose parameters are a <code>TempReg</code> |
| and a size, tests the V bits in the <code>TempReg</code>, at the |
| specified operation size (0/1/2/4 byte) and emits an error if any |
| of them indicate undefinedness. This is the only uopcode capable |
| of doing such tests. |
| <p> |
| <li><code>SETV</code>, whose parameters are also <code>TempReg</code> |
| and a size, makes the V bits in the <code>TempReg</code> indicated |
| definedness, at the specified operation size. This is usually |
| used to generate the correct V bits for a literal value, which is |
| of course fully defined. |
| <p> |
| <li><code>GETVF</code> and <code>PUTVF</code> are analogues to |
| <code>GETF</code> and <code>PUTF</code>. They move the single V |
| bit used to model definedness of <code>%EFLAGS</code> between its |
| home in <code>VG_(baseBlock)</code> and the specified |
| <code>TempReg</code>. |
| <p> |
| <li><code>TAG1</code> denotes one of a family of unary operations on |
| <code>TempReg</code>s containing V bits. Similarly, |
| <code>TAG2</code> denotes one in a family of binary operations on |
| V bits. |
| </ul> |
| |
| <p> |
| These 10 uopcodes are sufficient to express Valgrind's entire |
| definedness-checking semantics. In fact most of the interesting magic |
| is done by the <code>TAG1</code> and <code>TAG2</code> |
| suboperations. |
| |
| <p> |
| First, however, I need to explain about V-vector operation sizes. |
| There are 4 sizes: 1, 2 and 4, which operate on groups of 8, 16 and 32 |
| V bits at a time, supporting the usual 1, 2 and 4 byte x86 operations. |
| However there is also the mysterious size 0, which really means a |
| single V bit. Single V bits are used in various circumstances; in |
| particular, the definedness of <code>%EFLAGS</code> is modelled with a |
| single V bit. Now might be a good time to also point out that for |
| V bits, 1 means "undefined" and 0 means "defined". Similarly, for A |
| bits, 1 means "invalid address" and 0 means "valid address". This |
| seems counterintuitive (and so it is), but testing against zero on |
| x86s saves instructions compared to testing against all 1s, because |
| many ALU operations set the Z flag for free, so to speak. |
| |
| <p> |
| With that in mind, the tag ops are: |
| |
| <ul> |
| <li><b>(UNARY) Pessimising casts</b>: <code>VgT_PCast40</code>, |
| <code>VgT_PCast20</code>, <code>VgT_PCast10</code>, |
| <code>VgT_PCast01</code>, <code>VgT_PCast02</code> and |
| <code>VgT_PCast04</code>. A "pessimising cast" takes a V-bit |
| vector at one size, and creates a new one at another size, |
| pessimised in the sense that if any of the bits in the source |
| vector indicate undefinedness, then all the bits in the result |
| indicate undefinedness. In this case the casts are all to or from |
| a single V bit, so for example <code>VgT_PCast40</code> is a |
| pessimising cast from 32 bits to 1, whereas |
| <code>VgT_PCast04</code> simply copies the single source V bit |
| into all 32 bit positions in the result. Surprisingly, these ops |
| can all be implemented very efficiently. |
| <p> |
| There are also the pessimising casts <code>VgT_PCast14</code>, |
| from 8 bits to 32, <code>VgT_PCast12</code>, from 8 bits to 16, |
| and <code>VgT_PCast11</code>, from 8 bits to 8. This last one |
| seems nonsensical, but in fact it isn't a no-op because, as |
| mentioned above, any undefined (1) bits in the source infect the |
| entire result. |
| <p> |
| <li><b>(UNARY) Propagating undefinedness upwards in a word</b>: |
| <code>VgT_Left4</code>, <code>VgT_Left2</code> and |
| <code>VgT_Left1</code>. These are used to simulate the worst-case |
| effects of carry propagation in adds and subtracts. They return a |
| V vector identical to the original, except that if the original |
| contained any undefined bits, then it and all bits above it are |
| marked as undefined too. Hence the Left bit in the names. |
| <p> |
| <li><b>(UNARY) Signed and unsigned value widening</b>: |
| <code>VgT_SWiden14</code>, <code>VgT_SWiden24</code>, |
| <code>VgT_SWiden12</code>, <code>VgT_ZWiden14</code>, |
| <code>VgT_ZWiden24</code> and <code>VgT_ZWiden12</code>. These |
| mimic the definedness effects of standard signed and unsigned |
| integer widening. Unsigned widening creates zero bits in the new |
| positions, so <code>VgT_ZWiden*</code> accordingly park mark |
| those parts of their argument as defined. Signed widening copies |
| the sign bit into the new positions, so <code>VgT_SWiden*</code> |
| copies the definedness of the sign bit into the new positions. |
| Because 1 means undefined and 0 means defined, these operations |
| can (fascinatingly) be done by the same operations which they |
| mimic. Go figure. |
| <p> |
| <li><b>(BINARY) Undefined-if-either-Undefined, |
| Defined-if-either-Defined</b>: <code>VgT_UifU4</code>, |
| <code>VgT_UifU2</code>, <code>VgT_UifU1</code>, |
| <code>VgT_UifU0</code>, <code>VgT_DifD4</code>, |
| <code>VgT_DifD2</code>, <code>VgT_DifD1</code>. These do simple |
| bitwise operations on pairs of V-bit vectors, with |
| <code>UifU</code> giving undefined if either arg bit is |
| undefined, and <code>DifD</code> giving defined if either arg bit |
| is defined. Abstract interpretation junkies, if any make it this |
| far, may like to think of them as meets and joins (or is it joins |
| and meets) in the definedness lattices. |
| <p> |
| <li><b>(BINARY; one value, one V bits) Generate argument improvement |
| terms for AND and OR</b>: <code>VgT_ImproveAND4_TQ</code>, |
| <code>VgT_ImproveAND2_TQ</code>, <code>VgT_ImproveAND1_TQ</code>, |
| <code>VgT_ImproveOR4_TQ</code>, <code>VgT_ImproveOR2_TQ</code>, |
| <code>VgT_ImproveOR1_TQ</code>. These help out with AND and OR |
| operations. AND and OR have the inconvenient property that the |
| definedness of the result depends on the actual values of the |
| arguments as well as their definedness. At the bit level: |
| <br><code>1 AND undefined = undefined</code>, but |
| <br><code>0 AND undefined = 0</code>, and similarly |
| <br><code>0 OR undefined = undefined</code>, but |
| <br><code>1 OR undefined = 1</code>. |
| <br> |
| <p> |
| It turns out that gcc (quite legitimately) generates code which |
| relies on this fact, so we have to model it properly in order to |
| avoid flooding users with spurious value errors. The ultimate |
| definedness result of AND and OR is calculated using |
| <code>UifU</code> on the definedness of the arguments, but we |
| also <code>DifD</code> in some "improvement" terms which |
| take into account the above phenomena. |
| <p> |
| <code>ImproveAND</code> takes as its first argument the actual |
| value of an argument to AND (the T) and the definedness of that |
| argument (the Q), and returns a V-bit vector which is defined (0) |
| for bits which have value 0 and are defined; this, when |
| <code>DifD</code> into the final result causes those bits to be |
| defined even if the corresponding bit in the other argument is undefined. |
| <p> |
| The <code>ImproveOR</code> ops do the dual thing for OR |
| arguments. Note that XOR does not have this property that one |
| argument can make the other irrelevant, so there is no need for |
| such complexity for XOR. |
| </ul> |
| |
| <p> |
| That's all the tag ops. If you stare at this long enough, and then |
| run Valgrind and stare at the pre- and post-instrumented ucode, it |
| should be fairly obvious how the instrumentation machinery hangs |
| together. |
| |
| <p> |
| One point, if you do this: in order to make it easy to differentiate |
| <code>TempReg</code>s carrying values from <code>TempReg</code>s |
| carrying V bit vectors, Valgrind prints the former as (for example) |
| <code>t28</code> and the latter as <code>q28</code>; the fact that |
| they carry the same number serves to indicate their relationship. |
| This is purely for the convenience of the human reader; the register |
| allocator and code generator don't regard them as different. |
| |
| |
| <h3>Translation into UCode</h3> |
| |
| <code>VG_(disBB)</code> allocates a new <code>UCodeBlock</code> and |
| then uses <code>disInstr</code> to translate x86 instructions one at a |
| time into UCode, dumping the result in the <code>UCodeBlock</code>. |
| This goes on until a control-flow transfer instruction is encountered. |
| |
| <p> |
| Despite the large size of <code>vg_to_ucode.c</code>, this translation |
| is really very simple. Each x86 instruction is translated entirely |
| independently of its neighbours, merrily allocating new |
| <code>TempReg</code>s as it goes. The idea is to have a simple |
| translator -- in reality, no more than a macro-expander -- and the -- |
| resulting bad UCode translation is cleaned up by the UCode |
| optimisation phase which follows. To give you an idea of some x86 |
| instructions and their translations (this is a complete basic block, |
| as Valgrind sees it): |
| <pre> |
| 0x40435A50: incl %edx |
| |
| 0: GETL %EDX, t0 |
| 1: INCL t0 (-wOSZAP) |
| 2: PUTL t0, %EDX |
| |
| 0x40435A51: movsbl (%edx),%eax |
| |
| 3: GETL %EDX, t2 |
| 4: LDB (t2), t2 |
| 5: WIDENL_Bs t2 |
| 6: PUTL t2, %EAX |
| |
| 0x40435A54: testb $0x20, 1(%ecx,%eax,2) |
| |
| 7: GETL %EAX, t6 |
| 8: GETL %ECX, t8 |
| 9: LEA2L 1(t8,t6,2), t4 |
| 10: LDB (t4), t10 |
| 11: MOVB $0x20, t12 |
| 12: ANDB t12, t10 (-wOSZACP) |
| 13: INCEIPo $9 |
| |
| 0x40435A59: jnz-8 0x40435A50 |
| |
| 14: Jnzo $0x40435A50 (-rOSZACP) |
| 15: JMPo $0x40435A5B |
| </pre> |
| |
| <p> |
| Notice how the block always ends with an unconditional jump to the |
| next block. This is a bit unnecessary, but makes many things simpler. |
| |
| <p> |
| Most x86 instructions turn into sequences of <code>GET</code>, |
| <code>PUT</code>, <code>LEA1</code>, <code>LEA2</code>, |
| <code>LOAD</code> and <code>STORE</code>. Some complicated ones |
| however rely on calling helper bits of code in |
| <code>vg_helpers.S</code>. The ucode instructions <code>PUSH</code>, |
| <code>POP</code>, <code>CALL</code>, <code>CALLM_S</code> and |
| <code>CALLM_E</code> support this. The calling convention is somewhat |
| ad-hoc and is not the C calling convention. The helper routines must |
| save all integer registers, and the flags, that they use. Args are |
| passed on the stack underneath the return address, as usual, and if |
| result(s) are to be returned, it (they) are either placed in dummy arg |
| slots created by the ucode <code>PUSH</code> sequence, or just |
| overwrite the incoming args. |
| |
| <p> |
| In order that the instrumentation mechanism can handle calls to these |
| helpers, <code>VG_(saneUCodeBlock)</code> enforces the following |
| restrictions on calls to helpers: |
| |
| <ul> |
| <li>Each <code>CALL</code> uinstr must be bracketed by a preceding |
| <code>CALLM_S</code> marker (dummy uinstr) and a trailing |
| <code>CALLM_E</code> marker. These markers are used by the |
| instrumentation mechanism later to establish the boundaries of the |
| <code>PUSH</code>, <code>POP</code> and <code>CLEAR</code> |
| sequences for the call. |
| <p> |
| <li><code>PUSH</code>, <code>POP</code> and <code>CLEAR</code> |
| may only appear inside sections bracketed by <code>CALLM_S</code> |
| and <code>CALLM_E</code>, and nowhere else. |
| <p> |
| <li>In any such bracketed section, no two <code>PUSH</code> insns may |
| push the same <code>TempReg</code>. Dually, no two two |
| <code>POP</code>s may pop the same <code>TempReg</code>. |
| <p> |
| <li>Finally, although this is not checked, args should be removed from |
| the stack with <code>CLEAR</code>, rather than <code>POP</code>s |
| into a <code>TempReg</code> which is not subsequently used. This |
| is because the instrumentation mechanism assumes that all values |
| <code>POP</code>ped from the stack are actually used. |
| </ul> |
| |
| Some of the translations may appear to have redundant |
| <code>TempReg</code>-to-<code>TempReg</code> moves. This helps the |
| next phase, UCode optimisation, to generate better code. |
| |
| |
| |
| <h3>UCode optimisation</h3> |
| |
| UCode is then subjected to an improvement pass |
| (<code>vg_improve()</code>), which blurs the boundaries between the |
| translations of the original x86 instructions. It's pretty |
| straightforward. Three transformations are done: |
| |
| <ul> |
| <li>Redundant <code>GET</code> elimination. Actually, more general |
| than that -- eliminates redundant fetches of ArchRegs. In our |
| running example, uinstr 3 <code>GET</code>s <code>%EDX</code> into |
| <code>t2</code> despite the fact that, by looking at the previous |
| uinstr, it is already in <code>t0</code>. The <code>GET</code> is |
| therefore removed, and <code>t2</code> renamed to <code>t0</code>. |
| Assuming <code>t0</code> is allocated to a host register, it means |
| the simulated <code>%EDX</code> will exist in a host CPU register |
| for more than one simulated x86 instruction, which seems to me to |
| be a highly desirable property. |
| <p> |
| There is some mucking around to do with subregisters; |
| <code>%AL</code> vs <code>%AH</code> <code>%AX</code> vs |
| <code>%EAX</code> etc. I can't remember how it works, but in |
| general we are very conservative, and these tend to invalidate the |
| caching. |
| <p> |
| <li>Redundant <code>PUT</code> elimination. This annuls |
| <code>PUT</code>s of values back to simulated CPU registers if a |
| later <code>PUT</code> would overwrite the earlier |
| <code>PUT</code> value, and there is no intervening reads of the |
| simulated register (<code>ArchReg</code>). |
| <p> |
| As before, we are paranoid when faced with subregister references. |
| Also, <code>PUT</code>s of <code>%ESP</code> are never annulled, |
| because it is vital the instrumenter always has an up-to-date |
| <code>%ESP</code> value available, <code>%ESP</code> changes |
| affect addressibility of the memory around the simulated stack |
| pointer. |
| <p> |
| The implication of the above paragraph is that the simulated |
| machine's registers are only lazily updated once the above two |
| optimisation phases have run, with the exception of |
| <code>%ESP</code>. <code>TempReg</code>s go dead at the end of |
| every basic block, from which is is inferrable that any |
| <code>TempReg</code> caching a simulated CPU reg is flushed (back |
| into the relevant <code>VG_(baseBlock)</code> slot) at the end of |
| every basic block. The further implication is that the simulated |
| registers are only up-to-date at in between basic blocks, and not |
| at arbitrary points inside basic blocks. And the consequence of |
| that is that we can only deliver signals to the client in between |
| basic blocks. None of this seems any problem in practice. |
| <p> |
| <li>Finally there is a simple def-use thing for condition codes. If |
| an earlier uinstr writes the condition codes, and the next uinsn |
| along which actually cares about the condition codes writes the |
| same or larger set of them, but does not read any, the earlier |
| uinsn is marked as not writing any condition codes. This saves |
| a lot of redundant cond-code saving and restoring. |
| </ul> |
| |
| The effect of these transformations on our short block is rather |
| unexciting, and shown below. On longer basic blocks they can |
| dramatically improve code quality. |
| |
| <pre> |
| at 3: delete GET, rename t2 to t0 in (4 .. 6) |
| at 7: delete GET, rename t6 to t0 in (8 .. 9) |
| at 1: annul flag write OSZAP due to later OSZACP |
| |
| Improved code: |
| 0: GETL %EDX, t0 |
| 1: INCL t0 |
| 2: PUTL t0, %EDX |
| 4: LDB (t0), t0 |
| 5: WIDENL_Bs t0 |
| 6: PUTL t0, %EAX |
| 8: GETL %ECX, t8 |
| 9: LEA2L 1(t8,t0,2), t4 |
| 10: LDB (t4), t10 |
| 11: MOVB $0x20, t12 |
| 12: ANDB t12, t10 (-wOSZACP) |
| 13: INCEIPo $9 |
| 14: Jnzo $0x40435A50 (-rOSZACP) |
| 15: JMPo $0x40435A5B |
| </pre> |
| |
| <h3>UCode instrumentation</h3> |
| |
| Once you understand the meaning of the instrumentation uinstrs, |
| discussed in detail above, the instrumentation scheme is fairly |
| straighforward. Each uinstr is instrumented in isolation, and the |
| instrumentation uinstrs are placed before the original uinstr. |
| Our running example continues below. I have placed a blank line |
| after every original ucode, to make it easier to see which |
| instrumentation uinstrs correspond to which originals. |
| |
| <p> |
| As mentioned somewhere above, <code>TempReg</code>s carrying values |
| have names like <code>t28</code>, and each one has a shadow carrying |
| its V bits, with names like <code>q28</code>. This pairing aids in |
| reading instrumented ucode. |
| |
| <p> |
| One decision about all this is where to have "observation points", |
| that is, where to check that V bits are valid. I use a minimalistic |
| scheme, only checking where a failure of validity could cause the |
| original program to (seg)fault. So the use of values as memory |
| addresses causes a check, as do conditional jumps (these cause a check |
| on the definedness of the condition codes). And arguments |
| <code>PUSH</code>ed for helper calls are checked, hence the wierd |
| restrictions on help call preambles described above. |
| |
| <p> |
| Another decision is that once a value is tested, it is thereafter |
| regarded as defined, so that we do not emit multiple undefined-value |
| errors for the same undefined value. That means that |
| <code>TESTV</code> uinstrs are always followed by <code>SETV</code> |
| on the same (shadow) <code>TempReg</code>s. Most of these |
| <code>SETV</code>s are redundant and are removed by the |
| post-instrumentation cleanup phase. |
| |
| <p> |
| The instrumentation for calling helper functions deserves further |
| comment. The definedness of results from a helper is modelled using |
| just one V bit. So, in short, we do pessimising casts of the |
| definedness of all the args, down to a single bit, and then |
| <code>UifU</code> these bits together. So this single V bit will say |
| "undefined" if any part of any arg is undefined. This V bit is then |
| pessimally cast back up to the result(s) sizes, as needed. If, by |
| seeing that all the args are got rid of with <code>CLEAR</code> and |
| none with <code>POP</code>, Valgrind sees that the result of the call |
| is not actually used, it immediately examines the result V bit with a |
| <code>TESTV</code> -- <code>SETV</code> pair. If it did not do this, |
| there would be no observation point to detect that the some of the |
| args to the helper were undefined. Of course, if the helper's results |
| are indeed used, we don't do this, since the result usage will |
| presumably cause the result definedness to be checked at some suitable |
| future point. |
| |
| <p> |
| In general Valgrind tries to track definedness on a bit-for-bit basis, |
| but as the above para shows, for calls to helpers we throw in the |
| towel and approximate down to a single bit. This is because it's too |
| complex and difficult to track bit-level definedness through complex |
| ops such as integer multiply and divide, and in any case there is no |
| reasonable code fragments which attempt to (eg) multiply two |
| partially-defined values and end up with something meaningful, so |
| there seems little point in modelling multiplies, divides, etc, in |
| that level of detail. |
| |
| <p> |
| Integer loads and stores are instrumented with firstly a test of the |
| definedness of the address, followed by a <code>LOADV</code> or |
| <code>STOREV</code> respectively. These turn into calls to |
| (for example) <code>VG_(helperc_LOADV4)</code>. These helpers do two |
| things: they perform an address-valid check, and they load or store V |
| bits from/to the relevant address in the (simulated V-bit) memory. |
| |
| <p> |
| FPU loads and stores are different. As above the definedness of the |
| address is first tested. However, the helper routine for FPU loads |
| (<code>VGM_(fpu_read_check)</code>) emits an error if either the |
| address is invalid or the referenced area contains undefined values. |
| It has to do this because we do not simulate the FPU at all, and so |
| cannot track definedness of values loaded into it from memory, so we |
| have to check them as soon as they are loaded into the FPU, ie, at |
| this point. We notionally assume that everything in the FPU is |
| defined. |
| |
| <p> |
| It follows therefore that FPU writes first check the definedness of |
| the address, then the validity of the address, and finally mark the |
| written bytes as well-defined. |
| |
| <p> |
| If anyone is inspired to extend Valgrind to MMX/SSE insns, I suggest |
| you use the same trick. It works provided that the FPU/MMX unit is |
| not used to merely as a conduit to copy partially undefined data from |
| one place in memory to another. Unfortunately the integer CPU is used |
| like that (when copying C structs with holes, for example) and this is |
| the cause of much of the elaborateness of the instrumentation here |
| described. |
| |
| <p> |
| <code>vg_instrument()</code> in <code>vg_translate.c</code> actually |
| does the instrumentation. There are comments explaining how each |
| uinstr is handled, so we do not repeat that here. As explained |
| already, it is bit-accurate, except for calls to helper functions. |
| Unfortunately the x86 insns <code>bt/bts/btc/btr</code> are done by |
| helper fns, so bit-level accuracy is lost there. This should be fixed |
| by doing them inline; it will probably require adding a couple new |
| uinstrs. Also, left and right rotates through the carry flag (x86 |
| <code>rcl</code> and <code>rcr</code>) are approximated via a single |
| V bit; so far this has not caused anyone to complain. The |
| non-carry rotates, <code>rol</code> and <code>ror</code>, are much |
| more common and are done exactly. Re-visiting the instrumentation for |
| AND and OR, they seem rather verbose, and I wonder if it could be done |
| more concisely now. |
| |
| <p> |
| The lowercase <code>o</code> on many of the uopcodes in the running |
| example indicates that the size field is zero, usually meaning a |
| single-bit operation. |
| |
| <p> |
| Anyroads, the post-instrumented version of our running example looks |
| like this: |
| |
| <pre> |
| Instrumented code: |
| 0: GETVL %EDX, q0 |
| 1: GETL %EDX, t0 |
| |
| 2: TAG1o q0 = Left4 ( q0 ) |
| 3: INCL t0 |
| |
| 4: PUTVL q0, %EDX |
| 5: PUTL t0, %EDX |
| |
| 6: TESTVL q0 |
| 7: SETVL q0 |
| 8: LOADVB (t0), q0 |
| 9: LDB (t0), t0 |
| |
| 10: TAG1o q0 = SWiden14 ( q0 ) |
| 11: WIDENL_Bs t0 |
| |
| 12: PUTVL q0, %EAX |
| 13: PUTL t0, %EAX |
| |
| 14: GETVL %ECX, q8 |
| 15: GETL %ECX, t8 |
| |
| 16: MOVL q0, q4 |
| 17: SHLL $0x1, q4 |
| 18: TAG2o q4 = UifU4 ( q8, q4 ) |
| 19: TAG1o q4 = Left4 ( q4 ) |
| 20: LEA2L 1(t8,t0,2), t4 |
| |
| 21: TESTVL q4 |
| 22: SETVL q4 |
| 23: LOADVB (t4), q10 |
| 24: LDB (t4), t10 |
| |
| 25: SETVB q12 |
| 26: MOVB $0x20, t12 |
| |
| 27: MOVL q10, q14 |
| 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 ) |
| 29: TAG2o q10 = UifU1 ( q12, q10 ) |
| 30: TAG2o q10 = DifD1 ( q14, q10 ) |
| 31: MOVL q12, q14 |
| 32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 ) |
| 33: TAG2o q10 = DifD1 ( q14, q10 ) |
| 34: MOVL q10, q16 |
| 35: TAG1o q16 = PCast10 ( q16 ) |
| 36: PUTVFo q16 |
| 37: ANDB t12, t10 (-wOSZACP) |
| |
| 38: INCEIPo $9 |
| |
| 39: GETVFo q18 |
| 40: TESTVo q18 |
| 41: SETVo q18 |
| 42: Jnzo $0x40435A50 (-rOSZACP) |
| |
| 43: JMPo $0x40435A5B |
| </pre> |
| |
| |
| <h3>UCode post-instrumentation cleanup</h3> |
| |
| <p> |
| This pass, coordinated by <code>vg_cleanup()</code>, removes redundant |
| definedness computation created by the simplistic instrumentation |
| pass. It consists of two passes, |
| <code>vg_propagate_definedness()</code> followed by |
| <code>vg_delete_redundant_SETVs</code>. |
| |
| <p> |
| <code>vg_propagate_definedness()</code> is a simple |
| constant-propagation and constant-folding pass. It tries to determine |
| which <code>TempReg</code>s containing V bits will always indicate |
| "fully defined", and it propagates this information as far as it can, |
| and folds out as many operations as possible. For example, the |
| instrumentation for an ADD of a literal to a variable quantity will be |
| reduced down so that the definedness of the result is simply the |
| definedness of the variable quantity, since the literal is by |
| definition fully defined. |
| |
| <p> |
| <code>vg_delete_redundant_SETVs</code> removes <code>SETV</code>s on |
| shadow <code>TempReg</code>s for which the next action is a write. |
| I don't think there's anything else worth saying about this; it is |
| simple. Read the sources for details. |
| |
| <p> |
| So the cleaned-up running example looks like this. As above, I have |
| inserted line breaks after every original (non-instrumentation) uinstr |
| to aid readability. As with straightforward ucode optimisation, the |
| results in this block are undramatic because it is so short; longer |
| blocks benefit more because they have more redundancy which gets |
| eliminated. |
| |
| |
| <pre> |
| at 29: delete UifU1 due to defd arg1 |
| at 32: change ImproveAND1_TQ to MOV due to defd arg2 |
| at 41: delete SETV |
| at 31: delete MOV |
| at 25: delete SETV |
| at 22: delete SETV |
| at 7: delete SETV |
| |
| 0: GETVL %EDX, q0 |
| 1: GETL %EDX, t0 |
| |
| 2: TAG1o q0 = Left4 ( q0 ) |
| 3: INCL t0 |
| |
| 4: PUTVL q0, %EDX |
| 5: PUTL t0, %EDX |
| |
| 6: TESTVL q0 |
| 8: LOADVB (t0), q0 |
| 9: LDB (t0), t0 |
| |
| 10: TAG1o q0 = SWiden14 ( q0 ) |
| 11: WIDENL_Bs t0 |
| |
| 12: PUTVL q0, %EAX |
| 13: PUTL t0, %EAX |
| |
| 14: GETVL %ECX, q8 |
| 15: GETL %ECX, t8 |
| |
| 16: MOVL q0, q4 |
| 17: SHLL $0x1, q4 |
| 18: TAG2o q4 = UifU4 ( q8, q4 ) |
| 19: TAG1o q4 = Left4 ( q4 ) |
| 20: LEA2L 1(t8,t0,2), t4 |
| |
| 21: TESTVL q4 |
| 23: LOADVB (t4), q10 |
| 24: LDB (t4), t10 |
| |
| 26: MOVB $0x20, t12 |
| |
| 27: MOVL q10, q14 |
| 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 ) |
| 30: TAG2o q10 = DifD1 ( q14, q10 ) |
| 32: MOVL t12, q14 |
| 33: TAG2o q10 = DifD1 ( q14, q10 ) |
| 34: MOVL q10, q16 |
| 35: TAG1o q16 = PCast10 ( q16 ) |
| 36: PUTVFo q16 |
| 37: ANDB t12, t10 (-wOSZACP) |
| |
| 38: INCEIPo $9 |
| 39: GETVFo q18 |
| 40: TESTVo q18 |
| 42: Jnzo $0x40435A50 (-rOSZACP) |
| |
| 43: JMPo $0x40435A5B |
| </pre> |
| |
| |
| <h3>Translation from UCode</h3> |
| |
| This is all very simple, even though <code>vg_from_ucode.c</code> |
| is a big file. Position-independent x86 code is generated into |
| a dynamically allocated array <code>emitted_code</code>; this is |
| doubled in size when it overflows. Eventually the array is handed |
| back to the caller of <code>VG_(translate)</code>, who must copy |
| the result into TC and TT, and free the array. |
| |
| <p> |
| This file is structured into four layers of abstraction, which, |
| thankfully, are glued back together with extensive |
| <code>__inline__</code> directives. From the bottom upwards: |
| |
| <ul> |
| <li>Address-mode emitters, <code>emit_amode_regmem_reg</code> et al. |
| <p> |
| <li>Emitters for specific x86 instructions. There are quite a lot of |
| these, with names such as <code>emit_movv_offregmem_reg</code>. |
| The <code>v</code> suffix is Intel parlance for a 16/32 bit insn; |
| there are also <code>b</code> suffixes for 8 bit insns. |
| <p> |
| <li>The next level up are the <code>synth_*</code> functions, which |
| synthesise possibly a sequence of raw x86 instructions to do some |
| simple task. Some of these are quite complex because they have to |
| work around Intel's silly restrictions on subregister naming. See |
| <code>synth_nonshiftop_reg_reg</code> for example. |
| <p> |
| <li>Finally, at the top of the heap, we have |
| <code>emitUInstr()</code>, |
| which emits code for a single uinstr. |
| </ul> |
| |
| <p> |
| Some comments: |
| <ul> |
| <li>The hack for FPU instructions becomes apparent here. To do a |
| <code>FPU</code> ucode instruction, we load the simulated FPU's |
| state into from its <code>VG_(baseBlock)</code> into the real FPU |
| using an x86 <code>frstor</code> insn, do the ucode |
| <code>FPU</code> insn on the real CPU, and write the updated FPU |
| state back into <code>VG_(baseBlock)</code> using an |
| <code>fnsave</code> instruction. This is pretty brutal, but is |
| simple and it works, and even seems tolerably efficient. There is |
| no attempt to cache the simulated FPU state in the real FPU over |
| multiple back-to-back ucode FPU instructions. |
| <p> |
| <code>FPU_R</code> and <code>FPU_W</code> are also done this way, |
| with the minor complication that we need to patch in some |
| addressing mode bits so the resulting insn knows the effective |
| address to use. This is easy because of the regularity of the x86 |
| FPU instruction encodings. |
| <p> |
| <li>An analogous trick is done with ucode insns which claim, in their |
| <code>flags_r</code> and <code>flags_w</code> fields, that they |
| read or write the simulated <code>%EFLAGS</code>. For such cases |
| we first copy the simulated <code>%EFLAGS</code> into the real |
| <code>%eflags</code>, then do the insn, then, if the insn says it |
| writes the flags, copy back to <code>%EFLAGS</code>. This is a |
| bit expensive, which is why the ucode optimisation pass goes to |
| some effort to remove redundant flag-update annotations. |
| </ul> |
| |
| <p> |
| And so ... that's the end of the documentation for the instrumentating |
| translator! It's really not that complex, because it's composed as a |
| sequence of simple(ish) self-contained transformations on |
| straight-line blocks of code. |
| |
| |
| <h3>Top-level dispatch loop</h3> |
| |
| Urk. In <code>VG_(toploop)</code>. This is basically boring and |
| unsurprising, not to mention fiddly and fragile. It needs to be |
| cleaned up. |
| |
| <p> |
| The only perhaps surprise is that the whole thing is run |
| on top of a <code>setjmp</code>-installed exception handler, because, |
| supposing a translation got a segfault, we have to bail out of the |
| Valgrind-supplied exception handler <code>VG_(oursignalhandler)</code> |
| and immediately start running the client's segfault handler, if it has |
| one. In particular we can't finish the current basic block and then |
| deliver the signal at some convenient future point, because signals |
| like SIGILL, SIGSEGV and SIGBUS mean that the faulting insn should not |
| simply be re-tried. (I'm sure there is a clearer way to explain this). |
| |
| |
| <h3>Exceptions, creating new translations</h3> |
| <h3>Self-modifying code</h3> |
| |
| <h3>Lazy updates of the simulated program counter</h3> |
| |
| Simulated <code>%EIP</code> is not updated after every simulated x86 |
| insn as this was regarded as too expensive. Instead ucode |
| <code>INCEIP</code> insns move it along as and when necessary. |
| Currently we don't allow it to fall more than 4 bytes behind reality |
| (see <code>VG_(disBB)</code> for the way this works). |
| <p> |
| Note that <code>%EIP</code> is always brought up to date by the inner |
| dispatch loop in <code>VG_(dispatch)</code>, so that if the client |
| takes a fault we know at least which basic block this happened in. |
| |
| |
| <h3>The translation cache and translation table</h3> |
| |
| <h3>Signals</h3> |
| |
| Horrible, horrible. <code>vg_signals.c</code>. |
| Basically, since we have to intercept all system |
| calls anyway, we can see when the client tries to install a signal |
| handler. If it does so, we make a note of what the client asked to |
| happen, and ask the kernel to route the signal to our own signal |
| handler, <code>VG_(oursignalhandler)</code>. This simply notes the |
| delivery of signals, and returns. |
| |
| <p> |
| Every 1000 basic blocks, we see if more signals have arrived. If so, |
| <code>VG_(deliver_signals)</code> builds signal delivery frames on the |
| client's stack, and allows their handlers to be run. Valgrind places |
| in these signal delivery frames a bogus return address, |
| </code>VG_(signalreturn_bogusRA)</code>, and checks all jumps to see |
| if any jump to it. If so, this is a sign that a signal handler is |
| returning, and if so Valgrind removes the relevant signal frame from |
| the client's stack, restores the from the signal frame the simulated |
| state before the signal was delivered, and allows the client to run |
| onwards. We have to do it this way because some signal handlers never |
| return, they just <code>longjmp()</code>, which nukes the signal |
| delivery frame. |
| |
| <p> |
| The Linux kernel has a different but equally horrible hack for |
| detecting signal handler returns. Discovering it is left as an |
| exercise for the reader. |
| |
| |
| |
| <h3>Errors, error contexts, error reporting, suppressions</h3> |
| <h3>Client malloc/free</h3> |
| <h3>Low-level memory management</h3> |
| <h3>A and V bitmaps</h3> |
| <h3>Symbol table management</h3> |
| <h3>Dealing with system calls</h3> |
| <h3>Namespace management</h3> |
| <h3>GDB attaching</h3> |
| <h3>Non-dependence on glibc or anything else</h3> |
| <h3>The leak detector</h3> |
| <h3>Performance problems</h3> |
| <h3>Continuous sanity checking</h3> |
| <h3>Tracing, or not tracing, child processes</h3> |
| <h3>Assembly glue for syscalls</h3> |
| |
| |
| <hr width="100%"> |
| |
| <h2>Extensions</h2> |
| |
| Some comments about Stuff To Do. |
| |
| <h3>Bugs</h3> |
| |
| Stephan Kulow and Marc Mutz report problems with kmail in KDE 3 CVS |
| (RC2 ish) when run on Valgrind. Stephan has it deadlocking; Marc has |
| it looping at startup. I can't repro either behaviour. Needs |
| repro-ing and fixing. |
| |
| |
| <h3>Threads</h3> |
| |
| Doing a good job of thread support strikes me as almost a |
| research-level problem. The central issues are how to do fast cheap |
| locking of the <code>VG_(primary_map)</code> structure, whether or not |
| accesses to the individual secondary maps need locking, what |
| race-condition issues result, and whether the already-nasty mess that |
| is the signal simulator needs further hackery. |
| |
| <p> |
| I realise that threads are the most-frequently-requested feature, and |
| I am thinking about it all. If you have guru-level understanding of |
| fast mutual exclusion mechanisms and race conditions, I would be |
| interested in hearing from you. |
| |
| |
| <h3>Verification suite</h3> |
| |
| Directory <code>tests/</code> contains various ad-hoc tests for |
| Valgrind. However, there is no systematic verification or regression |
| suite, that, for example, exercises all the stuff in |
| <code>vg_memory.c</code>, to ensure that illegal memory accesses and |
| undefined value uses are detected as they should be. It would be good |
| to have such a suite. |
| |
| |
| <h3>Porting to other platforms</h3> |
| |
| It would be great if Valgrind was ported to FreeBSD and x86 NetBSD, |
| and to x86 OpenBSD, if it's possible (doesn't OpenBSD use a.out-style |
| executables, not ELF ?) |
| |
| <p> |
| The main difficulties, for an x86-ELF platform, seem to be: |
| |
| <ul> |
| <li>You'd need to rewrite the <code>/proc/self/maps</code> parser |
| (<code>vg_procselfmaps.c</code>). |
| Easy. |
| <p> |
| <li>You'd need to rewrite <code>vg_syscall_mem.c</code>, or, more |
| specifically, provide one for your OS. This is tedious, but you |
| can implement syscalls on demand, and the Linux kernel interface |
| is, for the most part, going to look very similar to the *BSD |
| interfaces, so it's really a copy-paste-and-modify-on-demand job. |
| As part of this, you'd need to supply a new |
| <code>vg_kerneliface.h</code> file. |
| <p> |
| <li>You'd also need to change the syscall wrappers for Valgrind's |
| internal use, in <code>vg_mylibc.c</code>. |
| </ul> |
| |
| All in all, I think a port to x86-ELF *BSDs is not really very |
| difficult, and in some ways I would like to see it happen, because |
| that would force a more clear factoring of Valgrind into platform |
| dependent and independent pieces. Not to mention, *BSD folks also |
| deserve to use Valgrind just as much as the Linux crew do. |
| |
| |
| <p> |
| <hr width="100%"> |
| |
| <h2>Easy stuff which ought to be done</h2> |
| |
| <h3>MMX instructions</h3> |
| |
| MMX insns should be supported, using the same trick as for FPU insns. |
| If the MMX registers are not used to copy uninitialised junk from one |
| place to another in memory, this means we don't have to actually |
| simulate the internal MMX unit state, so the FPU hack applies. This |
| should be fairly easy. |
| |
| |
| |
| <h3>Fix stabs-info reader</h3> |
| |
| The machinery in <code>vg_symtab2.c</code> which reads "stabs" style |
| debugging info is pretty weak. It usually correctly translates |
| simulated program counter values into line numbers and procedure |
| names, but the file name is often completely wrong. I think the |
| logic used to parse "stabs" entries is weak. It should be fixed. |
| The simplest solution, IMO, is to copy either the logic or simply the |
| code out of GNU binutils which does this; since GDB can clearly get it |
| right, binutils (or GDB?) must have code to do this somewhere. |
| |
| |
| |
| |
| |
| <h3>BT/BTC/BTS/BTR</h3> |
| |
| These are x86 instructions which test, complement, set, or reset, a |
| single bit in a word. At the moment they are both incorrectly |
| implemented and incorrectly instrumented. |
| |
| <p> |
| The incorrect instrumentation is due to use of helper functions. This |
| means we lose bit-level definedness tracking, which could wind up |
| giving spurious uninitialised-value use errors. The Right Thing to do |
| is to invent a couple of new UOpcodes, I think <code>GET_BIT</code> |
| and <code>SET_BIT</code>, which can be used to implement all 4 x86 |
| insns, get rid of the helpers, and give bit-accurate instrumentation |
| rules for the two new UOpcodes. |
| |
| <p> |
| I realised the other day that they are mis-implemented too. The x86 |
| insns take a bit-index and a register or memory location to access. |
| For registers the bit index clearly can only be in the range zero to |
| register-width minus 1, and I assumed the same applied to memory |
| locations too. But evidently not; for memory locations the index can |
| be arbitrary, and the processor will index arbitrarily into memory as |
| a result. This too should be fixed. Sigh. Presumably indexing |
| outside the immediate word is not actually used by any programs yet |
| tested on Valgrind, for otherwise they (presumably) would simply not |
| work at all. If you plan to hack on this, first check the Intel docs |
| to make sure my understanding is really correct. |
| |
| |
| |
| <h3>Using PREFETCH instructions</h3> |
| |
| Here's a small but potentially interesting project for performance |
| junkies. Experiments with valgrind's code generator and optimiser(s) |
| suggest that reducing the number of instructions executed in the |
| translations and mem-check helpers gives disappointingly small |
| performance improvements. Perhaps this is because performance of |
| Valgrindified code is limited by cache misses. After all, each read |
| in the original program now gives rise to at least three reads, one |
| for the <code>VG_(primary_map)</code>, one of the resulting |
| secondary, and the original. Not to mention, the instrumented |
| translations are 13 to 14 times larger than the originals. All in all |
| one would expect the memory system to be hammered to hell and then |
| some. |
| |
| <p> |
| So here's an idea. An x86 insn involving a read from memory, after |
| instrumentation, will turn into ucode of the following form: |
| <pre> |
| ... calculate effective addr, into ta and qa ... |
| TESTVL qa -- is the addr defined? |
| LOADV (ta), qloaded -- fetch V bits for the addr |
| LOAD (ta), tloaded -- do the original load |
| </pre> |
| At the point where the <code>LOADV</code> is done, we know the actual |
| address (<code>ta</code>) from which the real <code>LOAD</code> will |
| be done. We also know that the <code>LOADV</code> will take around |
| 20 x86 insns to do. So it seems plausible that doing a prefetch of |
| <code>ta</code> just before the <code>LOADV</code> might just avoid a |
| miss at the <code>LOAD</code> point, and that might be a significant |
| performance win. |
| |
| <p> |
| Prefetch insns are notoriously tempermental, more often than not |
| making things worse rather than better, so this would require |
| considerable fiddling around. It's complicated because Intels and |
| AMDs have different prefetch insns with different semantics, so that |
| too needs to be taken into account. As a general rule, even placing |
| the prefetches before the <code>LOADV</code> insn is too near the |
| <code>LOAD</code>; the ideal distance is apparently circa 200 CPU |
| cycles. So it might be worth having another analysis/transformation |
| pass which pushes prefetches as far back as possible, hopefully |
| immediately after the effective address becomes available. |
| |
| <p> |
| Doing too many prefetches is also bad because they soak up bus |
| bandwidth / cpu resources, so some cleverness in deciding which loads |
| to prefetch and which to not might be helpful. One can imagine not |
| prefetching client-stack-relative (<code>%EBP</code> or |
| <code>%ESP</code>) accesses, since the stack in general tends to show |
| good locality anyway. |
| |
| <p> |
| There's quite a lot of experimentation to do here, but I think it |
| might make an interesting week's work for someone. |
| |
| <p> |
| As of 15-ish March 2002, I've started to experiment with this, using |
| the AMD <code>prefetch/prefetchw</code> insns. |
| |
| |
| |
| <h3>User-defined permission ranges</h3> |
| |
| This is quite a large project -- perhaps a month's hacking for a |
| capable hacker to do a good job -- but it's potentially very |
| interesting. The outcome would be that Valgrind could detect a |
| whole class of bugs which it currently cannot. |
| |
| <p> |
| The presentation falls into two pieces. |
| |
| <p> |
| <b>Part 1: user-defined address-range permission setting</b> |
| <p> |
| |
| Valgrind intercepts the client's <code>malloc</code>, |
| <code>free</code>, etc calls, watches system calls, and watches the |
| stack pointer move. This is currently the only way it knows about |
| which addresses are valid and which not. Sometimes the client program |
| knows extra information about its memory areas. For example, the |
| client could at some point know that all elements of an array are |
| out-of-date. We would like to be able to convey to Valgrind this |
| information that the array is now addressable-but-uninitialised, so |
| that Valgrind can then warn if elements are used before they get new |
| values. |
| |
| <p> |
| What I would like are some macros like this: |
| <pre> |
| VALGRIND_MAKE_NOACCESS(addr, len) |
| VALGRIND_MAKE_WRITABLE(addr, len) |
| VALGRIND_MAKE_READABLE(addr, len) |
| </pre> |
| and also, to check that memory is addressible/initialised, |
| <pre> |
| VALGRIND_CHECK_ADDRESSIBLE(addr, len) |
| VALGRIND_CHECK_INITIALISED(addr, len) |
| </pre> |
| |
| <p> |
| I then include in my sources a header defining these macros, rebuild |
| my app, run under Valgrind, and get user-defined checks. |
| |
| <p> |
| Now here's a neat trick. It's a nuisance to have to re-link the app |
| with some new library which implements the above macros. So the idea |
| is to define the macros so that the resulting executable is still |
| completely stand-alone, and can be run without Valgrind, in which case |
| the macros do nothing, but when run on Valgrind, the Right Thing |
| happens. How to do this? The idea is for these macros to turn into a |
| piece of inline assembly code, which (1) has no effect when run on the |
| real CPU, (2) is easily spotted by Valgrind's JITter, and (3) no sane |
| person would ever write, which is important for avoiding false matches |
| in (2). So here's a suggestion: |
| <pre> |
| VALGRIND_MAKE_NOACCESS(addr, len) |
| </pre> |
| becomes (roughly speaking) |
| <pre> |
| movl addr, %eax |
| movl len, %ebx |
| movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be |
| -- 2, etc |
| rorl $13, %ecx |
| rorl $19, %ecx |
| rorl $11, %eax |
| rorl $21, %eax |
| </pre> |
| The rotate sequences have no effect, and it's unlikely they would |
| appear for any other reason, but they define a unique byte-sequence |
| which the JITter can easily spot. Using the operand constraints |
| section at the end of a gcc inline-assembly statement, we can tell gcc |
| that the assembly fragment kills <code>%eax</code>, <code>%ebx</code>, |
| <code>%ecx</code> and the condition codes, so this fragment is made |
| harmless when not running on Valgrind, runs quickly when not on |
| Valgrind, and does not require any other library support. |
| |
| |
| <p> |
| <b>Part 2: using it to detect interference between stack variables</b> |
| <p> |
| |
| Currently Valgrind cannot detect errors of the following form: |
| <pre> |
| void fooble ( void ) |
| { |
| int a[10]; |
| int b[10]; |
| a[10] = 99; |
| } |
| </pre> |
| Now imagine rewriting this as |
| <pre> |
| void fooble ( void ) |
| { |
| int spacer0; |
| int a[10]; |
| int spacer1; |
| int b[10]; |
| int spacer2; |
| VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int)); |
| VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int)); |
| VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int)); |
| a[10] = 99; |
| } |
| </pre> |
| Now the invalid write is certain to hit <code>spacer0</code> or |
| <code>spacer1</code>, so Valgrind will spot the error. |
| |
| <p> |
| There are two complications. |
| |
| <p> |
| The first is that we don't want to annotate sources by hand, so the |
| Right Thing to do is to write a C/C++ parser, annotator, prettyprinter |
| which does this automatically, and run it on post-CPP'd C/C++ source. |
| See http://www.cacheprof.org for an example of a system which |
| transparently inserts another phase into the gcc/g++ compilation |
| route. The parser/prettyprinter is probably not as hard as it sounds; |
| I would write it in Haskell, a powerful functional language well |
| suited to doing symbolic computation, with which I am intimately |
| familar. There is already a C parser written in Haskell by someone in |
| the Haskell community, and that would probably be a good starting |
| point. |
| |
| <p> |
| The second complication is how to get rid of these |
| <code>NOACCESS</code> records inside Valgrind when the instrumented |
| function exits; after all, these refer to stack addresses and will |
| make no sense whatever when some other function happens to re-use the |
| same stack address range, probably shortly afterwards. I think I |
| would be inclined to define a special stack-specific macro |
| <pre> |
| VALGRIND_MAKE_NOACCESS_STACK(addr, len) |
| </pre> |
| which causes Valgrind to record the client's <code>%ESP</code> at the |
| time it is executed. Valgrind will then watch for changes in |
| <code>%ESP</code> and discard such records as soon as the protected |
| area is uncovered by an increase in <code>%ESP</code>. I hesitate |
| with this scheme only because it is potentially expensive, if there |
| are hundreds of such records, and considering that changes in |
| <code>%ESP</code> already require expensive messing with stack access |
| permissions. |
| |
| <p> |
| This is probably easier and more robust than for the instrumenter |
| program to try and spot all exit points for the procedure and place |
| suitable deallocation annotations there. Plus C++ procedures can |
| bomb out at any point if they get an exception, so spotting return |
| points at the source level just won't work at all. |
| |
| <p> |
| Although some work, it's all eminently doable, and it would make |
| Valgrind into an even-more-useful tool. |
| |
| |
| <p> |
| </body> |
| </html> |