njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 1 | <?xml version="1.0"?> <!-- -*- sgml -*- --> |
| 2 | <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 3 | "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> |
| 4 | |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 5 | |
| 6 | <chapter id="mc-tech-docs" |
| 7 | xreflabel="The design and implementation of Valgrind"> |
| 8 | |
| 9 | <title>The Design and Implementation of Valgrind</title> |
| 10 | <subtitle>Detailed technical notes for hackers, maintainers and |
| 11 | the overly-curious</subtitle> |
| 12 | |
| 13 | <sect1 id="mc-tech-docs.intro" xreflabel="Introduction"> |
| 14 | <title>Introduction</title> |
| 15 | |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 16 | <para>This document contains a detailed, highly-technical description of |
| 17 | the internals of Valgrind. This is not the user manual; if you are an |
| 18 | end-user of Valgrind, you do not want to read this. Conversely, if you |
| 19 | really are a hacker-type and want to know how it works, I assume that |
| 20 | you have read the user manual thoroughly.</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 21 | |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 22 | <para>You may need to read this document several times, and carefully. |
| 23 | Some important things, I only say once.</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 24 | |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 25 | <para>[Note: this document is now very old, and a lot of its contents |
| 26 | are out of date, and misleading.]</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 27 | |
| 28 | |
| 29 | <sect2 id="mc-tech-docs.history" xreflabel="History"> |
| 30 | <title>History</title> |
| 31 | |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 32 | <para>Valgrind came into public view in late Feb 2002. However, it has |
| 33 | been under contemplation for a very long time, perhaps seriously for |
| 34 | about five years. Somewhat over two years ago, I started working on the |
| 35 | x86 code generator for the Glasgow Haskell Compiler |
| 36 | (http://www.haskell.org/ghc), gaining familiarity with x86 internals on |
| 37 | the way. I then did Cacheprof, gaining further x86 experience. Some |
| 38 | time around Feb 2000 I started experimenting with a user-space x86 |
| 39 | interpreter for x86-Linux. This worked, but it was clear that a |
| 40 | JIT-based scheme would be necessary to give reasonable performance for |
| 41 | Valgrind. Design work for the JITter started in earnest in Oct 2000, |
| 42 | and by early 2001 I had an x86-to-x86 dynamic translator which could run |
| 43 | quite large programs. This translator was in a sense pointless, since |
| 44 | it did not do any instrumentation or checking.</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 45 | |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 46 | <para>Most of the rest of 2001 was taken up designing and implementing |
| 47 | the instrumentation scheme. The main difficulty, which consumed a lot |
| 48 | of effort, was to design a scheme which did not generate large numbers |
| 49 | of false uninitialised-value warnings. By late 2001 a satisfactory |
| 50 | scheme had been arrived at, and I started to test it on ever-larger |
| 51 | programs, with an eventual eye to making it work well enough so that it |
| 52 | was helpful to folks debugging the upcoming version 3 of KDE. I've used |
| 53 | KDE since before version 1.0, and wanted to Valgrind to be an indirect |
| 54 | contribution to the KDE 3 development effort. At the start of Feb 02 |
| 55 | the kde-core-devel crew started using it, and gave a huge amount of |
| 56 | helpful feedback and patches in the space of three weeks. Snapshot |
| 57 | 20020306 is the result.</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 58 | |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 59 | <para>In the best Unix tradition, or perhaps in the spirit of Fred |
| 60 | Brooks' depressing-but-completely-accurate epitaph "build one to throw |
| 61 | away; you will anyway", much of Valgrind is a second or third rendition |
| 62 | of the initial idea. The instrumentation machinery |
| 63 | (<filename>vg_translate.c</filename>, <filename>vg_memory.c</filename>) |
| 64 | and core CPU simulation (<filename>vg_to_ucode.c</filename>, |
| 65 | <filename>vg_from_ucode.c</filename>) have had three redesigns and |
| 66 | rewrites; the register allocator, low-level memory manager |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 67 | (<filename>vg_malloc2.c</filename>) and symbol table reader |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 68 | (<filename>vg_symtab2.c</filename>) are on the second rewrite. In a |
| 69 | sense, this document serves to record some of the knowledge gained as a |
| 70 | result.</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 71 | |
| 72 | </sect2> |
| 73 | |
| 74 | |
| 75 | <sect2 id="mc-tech-docs.overview" xreflabel="Design overview"> |
| 76 | <title>Design overview</title> |
| 77 | |
| 78 | <para>Valgrind is compiled into a Linux shared object, |
| 79 | <filename>valgrind.so</filename>, and also a dummy one, |
| 80 | <filename>valgrinq.so</filename>, of which more later. The |
| 81 | <filename>valgrind</filename> shell script adds |
| 82 | <filename>valgrind.so</filename> to the |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 83 | <computeroutput>LD_PRELOAD</computeroutput> list of extra libraries to |
| 84 | be loaded with any dynamically linked library. This is a standard |
| 85 | trick, one which I assume the |
| 86 | <computeroutput>LD_PRELOAD</computeroutput> mechanism was developed to |
| 87 | support.</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 88 | |
| 89 | <para><filename>valgrind.so</filename> is linked with the |
| 90 | <computeroutput>-z initfirst</computeroutput> flag, which |
| 91 | requests that its initialisation code is run before that of any |
| 92 | other object in the executable image. When this happens, |
| 93 | valgrind gains control. The real CPU becomes "trapped" in |
| 94 | <filename>valgrind.so</filename> and the translations it |
| 95 | generates. The synthetic CPU provided by Valgrind does, however, |
| 96 | return from this initialisation function. So the normal startup |
| 97 | actions, orchestrated by the dynamic linker |
| 98 | <filename>ld.so</filename>, continue as usual, except on the |
| 99 | synthetic CPU, not the real one. Eventually |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 100 | <function>main</function> is run and returns, and |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 101 | then the finalisation code of the shared objects is run, |
| 102 | presumably in inverse order to which they were initialised. |
| 103 | Remember, this is still all happening on the simulated CPU. |
| 104 | Eventually <filename>valgrind.so</filename>'s own finalisation |
| 105 | code is called. It spots this event, shuts down the simulated |
| 106 | CPU, prints any error summaries and/or does leak detection, and |
| 107 | returns from the initialisation code on the real CPU. At this |
| 108 | point, in effect the real and synthetic CPUs have merged back |
| 109 | into one, Valgrind has lost control of the program, and the |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 110 | program finally <function>exit()s</function> back to |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 111 | the kernel in the usual way.</para> |
| 112 | |
| 113 | <para>The normal course of activity, once Valgrind has started |
| 114 | up, is as follows. Valgrind never runs any part of your program |
| 115 | (usually referred to as the "client"), not a single byte of it, |
| 116 | directly. Instead it uses function |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 117 | <function>VG_(translate)</function> to translate |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 118 | basic blocks (BBs, straight-line sequences of code) into |
| 119 | instrumented translations, and those are run instead. The |
| 120 | translations are stored in the translation cache (TC), |
| 121 | <computeroutput>vg_tc</computeroutput>, with the translation |
| 122 | table (TT), <computeroutput>vg_tt</computeroutput> supplying the |
| 123 | original-to-translation code address mapping. Auxiliary array |
| 124 | <computeroutput>VG_(tt_fast)</computeroutput> is used as a |
| 125 | direct-map cache for fast lookups in TT; it usually achieves a |
| 126 | hit rate of around 98% and facilitates an orig-to-trans lookup in |
| 127 | 4 x86 insns, which is not bad.</para> |
| 128 | |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 129 | <para>Function <function>VG_(dispatch)</function> in |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 130 | <filename>vg_dispatch.S</filename> is the heart of the JIT |
| 131 | dispatcher. Once a translated code address has been found, it is |
| 132 | executed simply by an x86 <computeroutput>call</computeroutput> |
| 133 | to the translation. At the end of the translation, the next |
| 134 | original code addr is loaded into |
| 135 | <computeroutput>%eax</computeroutput>, and the translation then |
| 136 | does a <computeroutput>ret</computeroutput>, taking it back to |
| 137 | the dispatch loop, with, interestingly, zero branch |
| 138 | mispredictions. The address requested in |
| 139 | <computeroutput>%eax</computeroutput> is looked up first in |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 140 | <function>VG_(tt_fast)</function>, and, if not found, |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 141 | by calling C helper |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 142 | <function>VG_(search_transtab)</function>. If there |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 143 | is still no translation available, |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 144 | <function>VG_(dispatch)</function> exits back to the |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 145 | top-level C dispatcher |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 146 | <function>VG_(toploop)</function>, which arranges for |
| 147 | <function>VG_(translate)</function> to make a new |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 148 | translation. All fairly unsurprising, really. There are various |
| 149 | complexities described below.</para> |
| 150 | |
| 151 | <para>The translator, orchestrated by |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 152 | <function>VG_(translate)</function>, is complicated |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 153 | but entirely self-contained. It is described in great detail in |
| 154 | subsequent sections. Translations are stored in TC, with TT |
| 155 | tracking administrative information. The translations are |
| 156 | subject to an approximate LRU-based management scheme. With the |
| 157 | current settings, the TC can hold at most about 15MB of |
| 158 | translations, and LRU passes prune it to about 13.5MB. Given |
| 159 | that the orig-to-translation expansion ratio is about 13:1 to |
| 160 | 14:1, this means TC holds translations for more or less a |
| 161 | megabyte of original code, which generally comes to about 70000 |
| 162 | basic blocks for C++ compiled with optimisation on. Generating |
| 163 | new translations is expensive, so it is worth having a large TC |
| 164 | to minimise the (capacity) miss rate.</para> |
| 165 | |
| 166 | <para>The dispatcher, |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 167 | <function>VG_(dispatch)</function>, receives hints |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 168 | from the translations which allow it to cheaply spot all control |
| 169 | transfers corresponding to x86 |
| 170 | <computeroutput>call</computeroutput> and |
| 171 | <computeroutput>ret</computeroutput> instructions. It has to do |
| 172 | this in order to spot some special events:</para> |
| 173 | |
| 174 | <itemizedlist> |
| 175 | <listitem> |
| 176 | <para>Calls to |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 177 | <function>VG_(shutdown)</function>. This is |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 178 | Valgrind's cue to exit. NOTE: actually this is done a |
| 179 | different way; it should be cleaned up.</para> |
| 180 | </listitem> |
| 181 | |
| 182 | <listitem> |
| 183 | <para>Returns of system call handlers, to the return address |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 184 | <function>VG_(signalreturn_bogusRA)</function>. |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 185 | The signal simulator needs to know when a signal handler is |
| 186 | returning, so we spot jumps (returns) to this address.</para> |
| 187 | </listitem> |
| 188 | |
| 189 | <listitem> |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 190 | <para>Calls to <function>vg_trap_here</function>. |
| 191 | All <function>malloc</function>, |
| 192 | <function>free</function>, etc calls that the |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 193 | client program makes are eventually routed to a call to |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 194 | <function>vg_trap_here</function>, and Valgrind |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 195 | does its own special thing with these calls. In effect this |
| 196 | provides a trapdoor, by which Valgrind can intercept certain |
| 197 | calls on the simulated CPU, run the call as it sees fit |
| 198 | itself (on the real CPU), and return the result to the |
| 199 | simulated CPU, quite transparently to the client |
| 200 | program.</para> |
| 201 | </listitem> |
| 202 | |
| 203 | </itemizedlist> |
| 204 | |
| 205 | <para>Valgrind intercepts the client's |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 206 | <function>malloc</function>, |
| 207 | <function>free</function>, etc, calls, so that it can |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 208 | store additional information. Each block |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 209 | <function>malloc</function>'d by the client gives |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 210 | rise to a shadow block in which Valgrind stores the call stack at |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 211 | the time of the <function>malloc</function> call. |
| 212 | When the client calls <function>free</function>, |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 213 | Valgrind tries to find the shadow block corresponding to the |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 214 | address passed to <function>free</function>, and |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 215 | emits an error message if none can be found. If it is found, the |
| 216 | block is placed on the freed blocks queue |
| 217 | <computeroutput>vg_freed_list</computeroutput>, it is marked as |
| 218 | inaccessible, and its shadow block now records the call stack at |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 219 | the time of the <function>free</function> call. |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 220 | Keeping <computeroutput>free</computeroutput>'d blocks in this |
| 221 | queue allows Valgrind to spot all (presumably invalid) accesses |
| 222 | to them. However, once the volume of blocks in the free queue |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 223 | exceeds <function>VG_(clo_freelist_vol)</function>, |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 224 | blocks are finally removed from the queue.</para> |
| 225 | |
| 226 | <para>Keeping track of <literal>A</literal> and |
| 227 | <literal>V</literal> bits (note: if you don't know what these |
| 228 | are, you haven't read the user guide carefully enough) for memory |
| 229 | is done in <filename>vg_memory.c</filename>. This implements a |
| 230 | sparse array structure which covers the entire 4G address space |
| 231 | in a way which is reasonably fast and reasonably space efficient. |
| 232 | The 4G address space is divided up into 64K sections, each |
| 233 | covering 64Kb of address space. Given a 32-bit address, the top |
| 234 | 16 bits are used to select one of the 65536 entries in |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 235 | <function>VG_(primary_map)</function>. The resulting |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 236 | "secondary" (<computeroutput>SecMap</computeroutput>) holds A and |
| 237 | V bits for the 64k of address space chunk corresponding to the |
| 238 | lower 16 bits of the address.</para> |
| 239 | |
| 240 | </sect2> |
| 241 | |
| 242 | |
| 243 | |
| 244 | <sect2 id="mc-tech-docs.design" xreflabel="Design decisions"> |
| 245 | <title>Design decisions</title> |
| 246 | |
| 247 | <para>Some design decisions were motivated by the need to make |
| 248 | Valgrind debuggable. Imagine you are writing a CPU simulator. |
| 249 | It works fairly well. However, you run some large program, like |
| 250 | Netscape, and after tens of millions of instructions, it crashes. |
| 251 | How can you figure out where in your simulator the bug is?</para> |
| 252 | |
| 253 | <para>Valgrind's answer is: cheat. Valgrind is designed so that |
| 254 | it is possible to switch back to running the client program on |
| 255 | the real CPU at any point. Using the |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 256 | <option>--stop-after= </option> flag, you can ask |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 257 | Valgrind to run just some number of basic blocks, and then run |
| 258 | the rest of the way on the real CPU. If you are searching for a |
| 259 | bug in the simulated CPU, you can use this to do a binary search, |
| 260 | which quickly leads you to the specific basic block which is |
| 261 | causing the problem.</para> |
| 262 | |
| 263 | <para>This is all very handy. It does constrain the design in |
| 264 | certain unimportant ways. Firstly, the layout of memory, when |
| 265 | viewed from the client's point of view, must be identical |
| 266 | regardless of whether it is running on the real or simulated CPU. |
| 267 | This means that Valgrind can't do pointer swizzling -- well, no |
| 268 | great loss -- and it can't run on the same stack as the client -- |
| 269 | again, no great loss. Valgrind operates on its own stack, |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 270 | <function>VG_(stack)</function>, which it switches to |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 271 | at startup, temporarily switching back to the client's stack when |
| 272 | doing system calls for the client.</para> |
| 273 | |
| 274 | <para>Valgrind also receives signals on its own stack, |
| 275 | <computeroutput>VG_(sigstack)</computeroutput>, but for different |
| 276 | gruesome reasons discussed below.</para> |
| 277 | |
| 278 | <para>This nice clean |
| 279 | switch-back-to-the-real-CPU-whenever-you-like story is muddied by |
| 280 | signals. Problem is that signals arrive at arbitrary times and |
| 281 | tend to slightly perturb the basic block count, with the result |
| 282 | that you can get close to the basic block causing a problem but |
| 283 | can't home in on it exactly. My kludgey hack is to define |
| 284 | <computeroutput>SIGNAL_SIMULATION</computeroutput> to 1 towards |
| 285 | the bottom of <filename>vg_syscall_mem.c</filename>, so that |
| 286 | signal handlers are run on the real CPU and don't change the BB |
| 287 | counts.</para> |
| 288 | |
| 289 | <para>A second hole in the switch-back-to-real-CPU story is that |
| 290 | Valgrind's way of delivering signals to the client is different |
| 291 | from that of the kernel. Specifically, the layout of the signal |
| 292 | delivery frame, and the mechanism used to detect a sighandler |
| 293 | returning, are different. So you can't expect to make the |
| 294 | transition inside a sighandler and still have things working, but |
| 295 | in practice that's not much of a restriction.</para> |
| 296 | |
| 297 | <para>Valgrind's implementation of |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 298 | <function>malloc</function>, |
| 299 | <function>free</function>, etc, (in |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 300 | <filename>vg_clientmalloc.c</filename>, not the low-level stuff |
| 301 | in <filename>vg_malloc2.c</filename>) is somewhat complicated by |
| 302 | the need to handle switching back at arbitrary points. It does |
| 303 | work tho.</para> |
| 304 | |
| 305 | </sect2> |
| 306 | |
| 307 | |
| 308 | |
| 309 | <sect2 id="mc-tech-docs.correctness" xreflabel="Correctness"> |
| 310 | <title>Correctness</title> |
| 311 | |
| 312 | <para>There's only one of me, and I have a Real Life (tm) as well |
| 313 | as hacking Valgrind [allegedly :-]. That means I don't have time |
| 314 | to waste chasing endless bugs in Valgrind. My emphasis is |
| 315 | therefore on doing everything as simply as possible, with |
| 316 | correctness, stability and robustness being the number one |
| 317 | priority, more important than performance or functionality. As a |
| 318 | result:</para> |
| 319 | |
| 320 | <itemizedlist> |
| 321 | |
| 322 | <listitem> |
| 323 | <para>The code is absolutely loaded with assertions, and |
| 324 | these are <command>permanently enabled.</command> I have no |
| 325 | plan to remove or disable them later. Over the past couple |
| 326 | of months, as valgrind has become more widely used, they have |
| 327 | shown their worth, pulling up various bugs which would |
| 328 | otherwise have appeared as hard-to-find segmentation |
| 329 | faults.</para> |
| 330 | |
| 331 | <para>I am of the view that it's acceptable to spend 5% of |
| 332 | the total running time of your valgrindified program doing |
| 333 | assertion checks and other internal sanity checks.</para> |
| 334 | </listitem> |
| 335 | |
| 336 | <listitem> |
| 337 | <para>Aside from the assertions, valgrind contains various |
| 338 | sets of internal sanity checks, which get run at varying |
| 339 | frequencies during normal operation. |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 340 | <function>VG_(do_sanity_checks)</function> runs |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 341 | every 1000 basic blocks, which means 500 to 2000 times/second |
| 342 | for typical machines at present. It checks that Valgrind |
| 343 | hasn't overrun its private stack, and does some simple checks |
| 344 | on the memory permissions maps. Once every 25 calls it does |
| 345 | some more extensive checks on those maps. Etc, etc.</para> |
| 346 | <para>The following components also have sanity check code, |
| 347 | which can be enabled to aid debugging:</para> |
| 348 | <itemizedlist> |
| 349 | <listitem><para>The low-level memory-manager |
| 350 | (<computeroutput>VG_(mallocSanityCheckArena)</computeroutput>). |
| 351 | This does a complete check of all blocks and chains in an |
| 352 | arena, which is very slow. Is not engaged by default.</para> |
| 353 | </listitem> |
| 354 | |
| 355 | <listitem> |
| 356 | <para>The symbol table reader(s): various checks to |
| 357 | ensure uniqueness of mappings; see |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 358 | <function>VG_(read_symbols)</function> for a |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 359 | start. Is permanently engaged.</para> |
| 360 | </listitem> |
| 361 | |
| 362 | <listitem> |
| 363 | <para>The A and V bit tracking stuff in |
| 364 | <filename>vg_memory.c</filename>. This can be compiled |
| 365 | with cpp symbol |
| 366 | <computeroutput>VG_DEBUG_MEMORY</computeroutput> defined, |
| 367 | which removes all the fast, optimised cases, and uses |
| 368 | simple-but-slow fallbacks instead. Not engaged by |
| 369 | default.</para> |
| 370 | </listitem> |
| 371 | |
| 372 | <listitem> |
| 373 | <para>Ditto |
| 374 | <computeroutput>VG_DEBUG_LEAKCHECK</computeroutput>.</para> |
| 375 | </listitem> |
| 376 | |
| 377 | <listitem> |
| 378 | <para>The JITter parses x86 basic blocks into sequences |
| 379 | of UCode instructions. It then sanity checks each one |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 380 | with <function>VG_(saneUInstr)</function> and |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 381 | sanity checks the sequence as a whole with |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 382 | <function>VG_(saneUCodeBlock)</function>. |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 383 | This stuff is engaged by default, and has caught some |
| 384 | way-obscure bugs in the simulated CPU machinery in its |
| 385 | time.</para> |
| 386 | </listitem> |
| 387 | |
| 388 | <listitem> |
| 389 | <para>The system call wrapper does |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 390 | <function>VG_(first_and_last_secondaries_look_plausible)</function> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 391 | after every syscall; this is known to pick up bugs in the |
| 392 | syscall wrappers. Engaged by default.</para> |
| 393 | </listitem> |
| 394 | |
| 395 | <listitem> |
| 396 | <para>The main dispatch loop, in |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 397 | <function>VG_(dispatch)</function>, checks |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 398 | that translations do not set |
| 399 | <computeroutput>%ebp</computeroutput> to any value |
| 400 | different from |
| 401 | <computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput> |
| 402 | or <computeroutput>& VG_(baseBlock)</computeroutput>. |
| 403 | In effect this test is free, and is permanently |
| 404 | engaged.</para> |
| 405 | </listitem> |
| 406 | |
| 407 | <listitem> |
| 408 | <para>There are a couple of ifdefed-out consistency |
| 409 | checks I inserted whilst debugging the new register |
| 410 | allocater, |
| 411 | <computeroutput>vg_do_register_allocation</computeroutput>.</para> |
| 412 | </listitem> |
| 413 | </itemizedlist> |
| 414 | </listitem> |
| 415 | |
| 416 | <listitem> |
| 417 | <para>I try to avoid techniques, algorithms, mechanisms, etc, |
| 418 | for which I can supply neither a convincing argument that |
| 419 | they are correct, nor sanity-check code which might pick up |
| 420 | bugs in my implementation. I don't always succeed in this, |
| 421 | but I try. Basically the idea is: avoid techniques which |
| 422 | are, in practice, unverifiable, in some sense. When doing |
| 423 | anything, always have in mind: "how can I verify that this is |
| 424 | correct?"</para> |
| 425 | </listitem> |
| 426 | |
| 427 | </itemizedlist> |
| 428 | |
| 429 | |
| 430 | <para>Some more specific things are:</para> |
| 431 | <itemizedlist> |
| 432 | <listitem> |
| 433 | <para>Valgrind runs in the same namespace as the client, at |
| 434 | least from <filename>ld.so</filename>'s point of view, and it |
| 435 | therefore absolutely had better not export any symbol with a |
| 436 | name which could clash with that of the client or any of its |
| 437 | libraries. Therefore, all globally visible symbols exported |
| 438 | from <filename>valgrind.so</filename> are defined using the |
| 439 | <computeroutput>VG_</computeroutput> CPP macro. As you'll |
| 440 | see from <filename>vg_constants.h</filename>, this appends |
| 441 | some arbitrary prefix to the symbol, in order that it be, we |
| 442 | hope, globally unique. Currently the prefix is |
| 443 | <computeroutput>vgPlain_</computeroutput>. For convenience |
| 444 | there are also <computeroutput>VGM_</computeroutput>, |
| 445 | <computeroutput>VGP_</computeroutput> and |
| 446 | <computeroutput>VGOFF_</computeroutput>. All locally defined |
| 447 | symbols are declared <computeroutput>static</computeroutput> |
| 448 | and do not appear in the final shared object.</para> |
| 449 | |
| 450 | <para>To check this, I periodically do <computeroutput>nm |
| 451 | valgrind.so | grep " T "</computeroutput>, which shows you |
| 452 | all the globally exported text symbols. They should all have |
| 453 | an approved prefix, except for those like |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 454 | <function>malloc</function>, |
| 455 | <function>free</function>, etc, which we |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 456 | deliberately want to shadow and take precedence over the same |
| 457 | names exported from <filename>glibc.so</filename>, so that |
| 458 | valgrind can intercept those calls easily. Similarly, |
| 459 | <computeroutput>nm valgrind.so | grep " D "</computeroutput> |
| 460 | allows you to find any rogue data-segment symbol |
| 461 | names.</para> |
| 462 | </listitem> |
| 463 | |
| 464 | <listitem> |
| 465 | <para>Valgrind tries, and almost succeeds, in being |
| 466 | completely independent of all other shared objects, in |
| 467 | particular of <filename>glibc.so</filename>. For example, we |
| 468 | have our own low-level memory manager in |
| 469 | <filename>vg_malloc2.c</filename>, which is a fairly standard |
| 470 | malloc/free scheme augmented with arenas, and |
| 471 | <filename>vg_mylibc.c</filename> exports reimplementations of |
| 472 | various bits and pieces you'd normally get from the C |
| 473 | library.</para> |
| 474 | |
| 475 | <para>Why all the hassle? Because imagine the potential |
| 476 | chaos of both the simulated and real CPUs executing in |
| 477 | <filename>glibc.so</filename>. It just seems simpler and |
| 478 | cleaner to be completely self-contained, so that only the |
| 479 | simulated CPU visits <filename>glibc.so</filename>. In |
| 480 | practice it's not much hassle anyway. Also, valgrind starts |
| 481 | up before glibc has a chance to initialise itself, and who |
| 482 | knows what difficulties that could lead to. Finally, glibc |
| 483 | has definitions for some types, specifically |
| 484 | <computeroutput>sigset_t</computeroutput>, which conflict |
| 485 | (are different from) the Linux kernel's idea of same. When |
| 486 | Valgrind wants to fiddle around with signal stuff, it wants |
| 487 | to use the kernel's definitions, not glibc's definitions. So |
| 488 | it's simplest just to keep glibc out of the picture |
| 489 | entirely.</para> |
| 490 | |
| 491 | <para>To find out which glibc symbols are used by Valgrind, |
| 492 | reinstate the link flags <computeroutput>-nostdlib |
| 493 | -Wl,-no-undefined</computeroutput>. This causes linking to |
| 494 | fail, but will tell you what you depend on. I have mostly, |
| 495 | but not entirely, got rid of the glibc dependencies; what |
| 496 | remains is, IMO, fairly harmless. AFAIK the current |
| 497 | dependencies are: <computeroutput>memset</computeroutput>, |
| 498 | <computeroutput>memcmp</computeroutput>, |
| 499 | <computeroutput>stat</computeroutput>, |
| 500 | <computeroutput>system</computeroutput>, |
| 501 | <computeroutput>sbrk</computeroutput>, |
| 502 | <computeroutput>setjmp</computeroutput> and |
| 503 | <computeroutput>longjmp</computeroutput>.</para> |
| 504 | </listitem> |
| 505 | |
| 506 | <listitem> |
| 507 | <para>Similarly, valgrind should not really import any |
| 508 | headers other than the Linux kernel headers, since it knows |
| 509 | of no API other than the kernel interface to talk to. At the |
| 510 | moment this is really not in a good state, and |
| 511 | <computeroutput>vg_syscall_mem</computeroutput> imports, via |
| 512 | <filename>vg_unsafe.h</filename>, a significant number of |
| 513 | C-library headers so as to know the sizes of various structs |
| 514 | passed across the kernel boundary. This is of course |
| 515 | completely bogus, since there is no guarantee that the C |
| 516 | library's definitions of these structs matches those of the |
| 517 | kernel. I have started to sort this out using |
| 518 | <filename>vg_kerneliface.h</filename>, into which I had |
| 519 | intended to copy all kernel definitions which valgrind could |
| 520 | need, but this has not gotten very far. At the moment it |
| 521 | mostly contains definitions for |
| 522 | <computeroutput>sigset_t</computeroutput> and |
| 523 | <computeroutput>struct sigaction</computeroutput>, since the |
| 524 | kernel's definition for these really does clash with glibc's. |
| 525 | I plan to use a <computeroutput>vki_</computeroutput> prefix |
| 526 | on all these types and constants, to denote the fact that |
| 527 | they pertain to <command>V</command>algrind's |
| 528 | <command>K</command>ernel |
| 529 | <command>I</command>nterface.</para> |
| 530 | |
| 531 | <para>Another advantage of having a |
| 532 | <filename>vg_kerneliface.h</filename> file is that it makes |
| 533 | it simpler to interface to a different kernel. Once can, for |
| 534 | example, easily imagine writing a new |
| 535 | <filename>vg_kerneliface.h</filename> for FreeBSD, or x86 |
| 536 | NetBSD.</para> |
| 537 | </listitem> |
| 538 | |
| 539 | </itemizedlist> |
| 540 | |
| 541 | </sect2> |
| 542 | |
| 543 | |
| 544 | |
| 545 | <sect2 id="mc-tech-docs.limits" xreflabel="Current limitations"> |
| 546 | <title>Current limitations</title> |
| 547 | |
| 548 | <para>Support for weird (non-POSIX) signal stuff is patchy. Does |
| 549 | anybody care?</para> |
| 550 | |
| 551 | </sect2> |
| 552 | |
| 553 | </sect1> |
| 554 | |
| 555 | |
| 556 | |
| 557 | |
| 558 | |
| 559 | <sect1 id="mc-tech-docs.jitter" xreflabel="The instrumenting JITter"> |
| 560 | <title>The instrumenting JITter</title> |
| 561 | |
| 562 | <para>This really is the heart of the matter. We begin with |
| 563 | various side issues.</para> |
| 564 | |
| 565 | |
| 566 | <sect2 id="mc-tech-docs.storage" |
| 567 | xreflabel="Run-time storage, and the use of host registers"> |
| 568 | <title>Run-time storage, and the use of host registers</title> |
| 569 | |
| 570 | <para>Valgrind translates client (original) basic blocks into |
| 571 | instrumented basic blocks, which live in the translation cache |
| 572 | TC, until either the client finishes or the translations are |
| 573 | ejected from TC to make room for newer ones.</para> |
| 574 | |
| 575 | <para>Since it generates x86 code in memory, Valgrind has |
| 576 | complete control of the use of registers in the translations. |
| 577 | Now pay attention. I shall say this only once, and it is |
| 578 | important you understand this. In what follows I will refer to |
| 579 | registers in the host (real) cpu using their standard names, |
| 580 | <computeroutput>%eax</computeroutput>, |
| 581 | <computeroutput>%edi</computeroutput>, etc. I refer to registers |
| 582 | in the simulated CPU by capitalising them: |
| 583 | <computeroutput>%EAX</computeroutput>, |
| 584 | <computeroutput>%EDI</computeroutput>, etc. These two sets of |
| 585 | registers usually bear no direct relationship to each other; |
| 586 | there is no fixed mapping between them. This naming scheme is |
| 587 | used fairly consistently in the comments in the sources.</para> |
| 588 | |
| 589 | <para>Host registers, once things are up and running, are used as |
| 590 | follows:</para> |
| 591 | |
| 592 | <itemizedlist> |
| 593 | <listitem> |
| 594 | <para><computeroutput>%esp</computeroutput>, the real stack |
| 595 | pointer, points somewhere in Valgrind's private stack area, |
| 596 | <computeroutput>VG_(stack)</computeroutput> or, transiently, |
| 597 | into its signal delivery stack, |
| 598 | <computeroutput>VG_(sigstack)</computeroutput>.</para> |
| 599 | </listitem> |
| 600 | |
| 601 | <listitem> |
| 602 | <para><computeroutput>%edi</computeroutput> is used as a |
| 603 | temporary in code generation; it is almost always dead, |
| 604 | except when used for the |
| 605 | <computeroutput>Left</computeroutput> value-tag operations.</para> |
| 606 | </listitem> |
| 607 | |
| 608 | <listitem> |
| 609 | <para><computeroutput>%eax</computeroutput>, |
| 610 | <computeroutput>%ebx</computeroutput>, |
| 611 | <computeroutput>%ecx</computeroutput>, |
| 612 | <computeroutput>%edx</computeroutput> and |
| 613 | <computeroutput>%esi</computeroutput> are available to |
| 614 | Valgrind's register allocator. They are dead (carry |
| 615 | unimportant values) in between translations, and are live |
| 616 | only in translations. The one exception to this is |
| 617 | <computeroutput>%eax</computeroutput>, which, as mentioned |
| 618 | far above, has a special significance to the dispatch loop |
| 619 | <computeroutput>VG_(dispatch)</computeroutput>: when a |
| 620 | translation returns to the dispatch loop, |
| 621 | <computeroutput>%eax</computeroutput> is expected to contain |
| 622 | the original-code-address of the next translation to run. |
| 623 | The register allocator is so good at minimising spill code |
| 624 | that using five regs and not having to save/restore |
| 625 | <computeroutput>%edi</computeroutput> actually gives better |
| 626 | code than allocating to <computeroutput>%edi</computeroutput> |
| 627 | as well, but then having to push/pop it around special |
| 628 | uses.</para> |
| 629 | </listitem> |
| 630 | |
| 631 | <listitem> |
| 632 | <para><computeroutput>%ebp</computeroutput> points |
| 633 | permanently at |
| 634 | <computeroutput>VG_(baseBlock)</computeroutput>. Valgrind's |
| 635 | translations are position-independent, partly because this is |
| 636 | convenient, but also because translations get moved around in |
| 637 | TC as part of the LRUing activity. <command>All</command> |
| 638 | static entities which need to be referred to from generated |
| 639 | code, whether data or helper functions, are stored starting |
| 640 | at <computeroutput>VG_(baseBlock)</computeroutput> and are |
| 641 | therefore reached by indexing from |
| 642 | <computeroutput>%ebp</computeroutput>. There is but one |
| 643 | exception, which is that by placing the value |
| 644 | <computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput> in |
| 645 | <computeroutput>%ebp</computeroutput> just before a return to |
| 646 | the dispatcher, the dispatcher is informed that the next |
| 647 | address to run, in <computeroutput>%eax</computeroutput>, |
| 648 | requires special treatment.</para> |
| 649 | </listitem> |
| 650 | |
| 651 | <listitem> |
| 652 | <para>The real machine's FPU state is pretty much |
| 653 | unimportant, for reasons which will become obvious. Ditto |
| 654 | its <computeroutput>%eflags</computeroutput> register.</para> |
| 655 | </listitem> |
| 656 | |
| 657 | </itemizedlist> |
| 658 | |
| 659 | <para>The state of the simulated CPU is stored in memory, in |
| 660 | <computeroutput>VG_(baseBlock)</computeroutput>, which is a block |
| 661 | of 200 words IIRC. Recall that |
| 662 | <computeroutput>%ebp</computeroutput> points permanently at the |
| 663 | start of this block. Function |
| 664 | <computeroutput>vg_init_baseBlock</computeroutput> decides what |
| 665 | the offsets of various entities in |
| 666 | <computeroutput>VG_(baseBlock)</computeroutput> are to be, and |
| 667 | allocates word offsets for them. The code generator then emits |
| 668 | <computeroutput>%ebp</computeroutput> relative addresses to get |
| 669 | at those things. The sequence in which entities are allocated |
| 670 | has been carefully chosen so that the 32 most popular entities |
| 671 | come first, because this means 8-bit offsets can be used in the |
| 672 | generated code.</para> |
| 673 | |
| 674 | <para>If I was clever, I could make |
| 675 | <computeroutput>%ebp</computeroutput> point 32 words along |
| 676 | <computeroutput>VG_(baseBlock)</computeroutput>, so that I'd have |
| 677 | another 32 words of short-form offsets available, but that's just |
| 678 | complicated, and it's not important -- the first 32 words take |
| 679 | 99% (or whatever) of the traffic.</para> |
| 680 | |
| 681 | <para>Currently, the sequence of stuff in |
| 682 | <computeroutput>VG_(baseBlock)</computeroutput> is as |
| 683 | follows:</para> |
| 684 | |
| 685 | <itemizedlist> |
| 686 | <listitem> |
| 687 | <para>9 words, holding the simulated integer registers, |
| 688 | <computeroutput>%EAX</computeroutput> |
| 689 | .. <computeroutput>%EDI</computeroutput>, and the simulated |
| 690 | flags, <computeroutput>%EFLAGS</computeroutput>.</para> |
| 691 | </listitem> |
| 692 | |
| 693 | <listitem> |
| 694 | <para>Another 9 words, holding the V bit "shadows" for the |
| 695 | above 9 regs.</para> |
| 696 | </listitem> |
| 697 | |
| 698 | <listitem> |
| 699 | <para>The <command>addresses</command> of various helper |
| 700 | routines called from generated code: |
| 701 | <computeroutput>VG_(helper_value_check4_fail)</computeroutput>, |
| 702 | <computeroutput>VG_(helper_value_check0_fail)</computeroutput>, |
| 703 | which register V-check failures, |
| 704 | <computeroutput>VG_(helperc_STOREV4)</computeroutput>, |
| 705 | <computeroutput>VG_(helperc_STOREV1)</computeroutput>, |
| 706 | <computeroutput>VG_(helperc_LOADV4)</computeroutput>, |
| 707 | <computeroutput>VG_(helperc_LOADV1)</computeroutput>, which |
| 708 | do stores and loads of V bits to/from the sparse array which |
| 709 | keeps track of V bits in memory, and |
| 710 | <computeroutput>VGM_(handle_esp_assignment)</computeroutput>, |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 711 | which messes with memory addressability resulting from |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 712 | changes in <computeroutput>%ESP</computeroutput>.</para> |
| 713 | </listitem> |
| 714 | |
| 715 | <listitem> |
| 716 | <para>The simulated <computeroutput>%EIP</computeroutput>.</para> |
| 717 | </listitem> |
| 718 | |
| 719 | <listitem> |
| 720 | <para>24 spill words, for when the register allocator can't |
| 721 | make it work with 5 measly registers.</para> |
| 722 | </listitem> |
| 723 | |
| 724 | <listitem> |
| 725 | <para>Addresses of helpers |
| 726 | <computeroutput>VG_(helperc_STOREV2)</computeroutput>, |
| 727 | <computeroutput>VG_(helperc_LOADV2)</computeroutput>. These |
| 728 | are here because 2-byte loads and stores are relatively rare, |
| 729 | so are placed above the magic 32-word offset boundary.</para> |
| 730 | </listitem> |
| 731 | |
| 732 | <listitem> |
| 733 | <para>For similar reasons, addresses of helper functions |
| 734 | <computeroutput>VGM_(fpu_write_check)</computeroutput> and |
| 735 | <computeroutput>VGM_(fpu_read_check)</computeroutput>, which |
| 736 | handle the A/V maps testing and changes required by FPU |
| 737 | writes/reads.</para> |
| 738 | </listitem> |
| 739 | |
| 740 | <listitem> |
| 741 | <para>Some other boring helper addresses: |
| 742 | <computeroutput>VG_(helper_value_check2_fail)</computeroutput> |
| 743 | and |
| 744 | <computeroutput>VG_(helper_value_check1_fail)</computeroutput>. |
| 745 | These are probably never emitted now, and should be |
| 746 | removed.</para> |
| 747 | </listitem> |
| 748 | |
| 749 | <listitem> |
| 750 | <para>The entire state of the simulated FPU, which I believe |
| 751 | to be 108 bytes long.</para> |
| 752 | </listitem> |
| 753 | |
| 754 | <listitem> |
| 755 | <para>Finally, the addresses of various other helper |
| 756 | functions in <filename>vg_helpers.S</filename>, which deal |
| 757 | with rare situations which are tedious or difficult to |
| 758 | generate code in-line for.</para> |
| 759 | </listitem> |
| 760 | |
| 761 | </itemizedlist> |
| 762 | |
| 763 | <para>As a general rule, the simulated machine's state lives |
| 764 | permanently in memory at |
| 765 | <computeroutput>VG_(baseBlock)</computeroutput>. However, the |
| 766 | JITter does some optimisations which allow the simulated integer |
| 767 | registers to be cached in real registers over multiple simulated |
| 768 | instructions within the same basic block. These are always |
| 769 | flushed back into memory at the end of every basic block, so that |
| 770 | the in-memory state is up-to-date between basic blocks. (This |
| 771 | flushing is implied by the statement above that the real |
| 772 | machine's allocatable registers are dead in between simulated |
| 773 | blocks).</para> |
| 774 | |
| 775 | </sect2> |
| 776 | |
| 777 | |
| 778 | |
| 779 | <sect2 id="mc-tech-docs.startup" |
| 780 | xreflabel="Startup, shutdown, and system calls"> |
| 781 | <title>Startup, shutdown, and system calls</title> |
| 782 | |
| 783 | <para>Getting into of Valgrind |
| 784 | (<computeroutput>VG_(startup)</computeroutput>, called from |
| 785 | <filename>valgrind.so</filename>'s initialisation section), |
| 786 | really means copying the real CPU's state into |
| 787 | <computeroutput>VG_(baseBlock)</computeroutput>, and then |
| 788 | installing our own stack pointer, etc, into the real CPU, and |
| 789 | then starting up the JITter. Exiting valgrind involves copying |
| 790 | the simulated state back to the real state.</para> |
| 791 | |
| 792 | <para>Unfortunately, there's a complication at startup time. |
| 793 | Problem is that at the point where we need to take a snapshot of |
| 794 | the real CPU's state, the offsets in |
| 795 | <computeroutput>VG_(baseBlock)</computeroutput> are not set up |
| 796 | yet, because to do so would involve disrupting the real machine's |
| 797 | state significantly. The way round this is to dump the real |
| 798 | machine's state into a temporary, static block of memory, |
| 799 | <computeroutput>VG_(m_state_static)</computeroutput>. We can |
| 800 | then set up the <computeroutput>VG_(baseBlock)</computeroutput> |
| 801 | offsets at our leisure, and copy into it from |
| 802 | <computeroutput>VG_(m_state_static)</computeroutput> at some |
| 803 | convenient later time. This copying is done by |
| 804 | <computeroutput>VG_(copy_m_state_static_to_baseBlock)</computeroutput>.</para> |
| 805 | |
| 806 | <para>On exit, the inverse transformation is (rather |
| 807 | unnecessarily) used: stuff in |
| 808 | <computeroutput>VG_(baseBlock)</computeroutput> is copied to |
| 809 | <computeroutput>VG_(m_state_static)</computeroutput>, and the |
| 810 | assembly stub then copies from |
| 811 | <computeroutput>VG_(m_state_static)</computeroutput> into the |
| 812 | real machine registers.</para> |
| 813 | |
| 814 | <para>Doing system calls on behalf of the client |
| 815 | (<filename>vg_syscall.S</filename>) is something of a half-way |
| 816 | house. We have to make the world look sufficiently like that |
| 817 | which the client would normally have to make the syscall actually |
| 818 | work properly, but we can't afford to lose control. So the trick |
| 819 | is to copy all of the client's state, <command>except its program |
| 820 | counter</command>, into the real CPU, do the system call, and |
| 821 | copy the state back out. Note that the client's state includes |
| 822 | its stack pointer register, so one effect of this partial |
| 823 | restoration is to cause the system call to be run on the client's |
| 824 | stack, as it should be.</para> |
| 825 | |
| 826 | <para>As ever there are complications. We have to save some of |
| 827 | our own state somewhere when restoring the client's state into |
| 828 | the CPU, so that we can keep going sensibly afterwards. In fact |
| 829 | the only thing which is important is our own stack pointer, but |
| 830 | for paranoia reasons I save and restore our own FPU state as |
| 831 | well, even though that's probably pointless.</para> |
| 832 | |
| 833 | <para>The complication on the above complication is, that for |
| 834 | horrible reasons to do with signals, we may have to handle a |
| 835 | second client system call whilst the client is blocked inside |
| 836 | some other system call (unbelievable!). That means there's two |
| 837 | sets of places to dump Valgrind's stack pointer and FPU state |
| 838 | across the syscall, and we decide which to use by consulting |
| 839 | <computeroutput>VG_(syscall_depth)</computeroutput>, which is in |
| 840 | turn maintained by |
| 841 | <computeroutput>VG_(wrap_syscall)</computeroutput>.</para> |
| 842 | |
| 843 | </sect2> |
| 844 | |
| 845 | |
| 846 | |
| 847 | <sect2 id="mc-tech-docs.ucode" xreflabel="Introduction to UCode"> |
| 848 | <title>Introduction to UCode</title> |
| 849 | |
| 850 | <para>UCode lies at the heart of the x86-to-x86 JITter. The |
| 851 | basic premise is that dealing the the x86 instruction set head-on |
| 852 | is just too darn complicated, so we do the traditional |
| 853 | compiler-writer's trick and translate it into a simpler, |
| 854 | easier-to-deal-with form.</para> |
| 855 | |
| 856 | <para>In normal operation, translation proceeds through six |
| 857 | stages, coordinated by |
| 858 | <computeroutput>VG_(translate)</computeroutput>:</para> |
| 859 | |
| 860 | <orderedlist> |
| 861 | <listitem> |
| 862 | <para>Parsing of an x86 basic block into a sequence of UCode |
| 863 | instructions (<computeroutput>VG_(disBB)</computeroutput>).</para> |
| 864 | </listitem> |
| 865 | |
| 866 | <listitem> |
| 867 | <para>UCode optimisation |
| 868 | (<computeroutput>vg_improve</computeroutput>), with the aim |
| 869 | of caching simulated registers in real registers over |
| 870 | multiple simulated instructions, and removing redundant |
| 871 | simulated <computeroutput>%EFLAGS</computeroutput> |
| 872 | saving/restoring.</para> |
| 873 | </listitem> |
| 874 | |
| 875 | <listitem> |
| 876 | <para>UCode instrumentation |
| 877 | (<computeroutput>vg_instrument</computeroutput>), which adds |
| 878 | value and address checking code.</para> |
| 879 | </listitem> |
| 880 | |
| 881 | <listitem> |
| 882 | <para>Post-instrumentation cleanup |
| 883 | (<computeroutput>vg_cleanup</computeroutput>), removing |
| 884 | redundant value-check computations.</para> |
| 885 | </listitem> |
| 886 | |
| 887 | <listitem> |
| 888 | <para>Register allocation |
| 889 | (<computeroutput>vg_do_register_allocation</computeroutput>), |
| 890 | which, note, is done on UCode.</para> |
| 891 | </listitem> |
| 892 | |
| 893 | <listitem> |
| 894 | <para>Emission of final instrumented x86 code |
| 895 | (<computeroutput>VG_(emit_code)</computeroutput>).</para> |
| 896 | </listitem> |
| 897 | |
| 898 | </orderedlist> |
| 899 | |
| 900 | <para>Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode |
| 901 | transformation passes, all on straight-line blocks of UCode (type |
| 902 | <computeroutput>UCodeBlock</computeroutput>). Steps 2 and 4 are |
| 903 | optimisation passes and can be disabled for debugging purposes, |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 904 | with <option>--optimise=no</option> and |
| 905 | <option>--cleanup=no</option> respectively.</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 906 | |
| 907 | <para>Valgrind can also run in a no-instrumentation mode, given |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 908 | <option>--instrument=no</option>. This is useful |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 909 | for debugging the JITter quickly without having to deal with the |
| 910 | complexity of the instrumentation mechanism too. In this mode, |
| 911 | steps 3 and 4 are omitted.</para> |
| 912 | |
| 913 | <para>These flags combine, so that |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 914 | <option>--instrument=no</option> together with |
| 915 | <option>--optimise=no</option> means only steps |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 916 | 1, 5 and 6 are used. |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 917 | <option>--single-step=yes</option> causes each |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 918 | x86 instruction to be treated as a single basic block. The |
| 919 | translations are terrible but this is sometimes instructive.</para> |
| 920 | |
de | 03e0e7c | 2005-12-03 23:02:33 +0000 | [diff] [blame] | 921 | <para>The <option>--stop-after=N</option> flag |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 922 | switches back to the real CPU after |
| 923 | <computeroutput>N</computeroutput> basic blocks. It also re-JITs |
| 924 | the final basic block executed and prints the debugging info |
| 925 | resulting, so this gives you a way to get a quick snapshot of how |
| 926 | a basic block looks as it passes through the six stages mentioned |
| 927 | above. If you want to see full information for every block |
| 928 | translated (probably not, but still ...) find, in |
| 929 | <computeroutput>VG_(translate)</computeroutput>, the lines</para> |
| 930 | <programlisting><![CDATA[ |
| 931 | dis = True; |
| 932 | dis = debugging_translation;]]></programlisting> |
| 933 | |
| 934 | <para>and comment out the second line. This will spew out |
| 935 | debugging junk faster than you can possibly imagine.</para> |
| 936 | |
| 937 | </sect2> |
| 938 | |
| 939 | |
| 940 | |
| 941 | <sect2 id="mc-tech-docs.tags" xreflabel="UCode operand tags: type 'Tag'"> |
| 942 | <title>UCode operand tags: type <computeroutput>Tag</computeroutput></title> |
| 943 | |
| 944 | <para>UCode is, more or less, a simple two-address RISC-like |
| 945 | code. In keeping with the x86 AT&T assembly syntax, |
| 946 | generally speaking the first operand is the source operand, and |
| 947 | the second is the destination operand, which is modified when the |
| 948 | uinstr is notionally executed.</para> |
| 949 | |
| 950 | <para>UCode instructions have up to three operand fields, each of |
| 951 | which has a corresponding <computeroutput>Tag</computeroutput> |
| 952 | describing it. Possible values for the tag are:</para> |
| 953 | |
| 954 | <itemizedlist> |
| 955 | |
| 956 | <listitem> |
| 957 | <para><computeroutput>NoValue</computeroutput>: indicates |
| 958 | that the field is not in use.</para> |
| 959 | </listitem> |
| 960 | |
| 961 | <listitem> |
| 962 | <para><computeroutput>Lit16</computeroutput>: the field |
| 963 | contains a 16-bit literal.</para> |
| 964 | </listitem> |
| 965 | |
| 966 | <listitem> |
| 967 | <para><computeroutput>Literal</computeroutput>: the field |
| 968 | denotes a 32-bit literal, whose value is stored in the |
| 969 | <computeroutput>lit32</computeroutput> field of the uinstr |
| 970 | itself. Since there is only one |
| 971 | <computeroutput>lit32</computeroutput> for the whole uinstr, |
| 972 | only one operand field may contain this tag.</para> |
| 973 | </listitem> |
| 974 | |
| 975 | <listitem> |
| 976 | <para><computeroutput>SpillNo</computeroutput>: the field |
| 977 | contains a spill slot number, in the range 0 to 23 inclusive, |
| 978 | denoting one of the spill slots contained inside |
| 979 | <computeroutput>VG_(baseBlock)</computeroutput>. Such tags |
| 980 | only exist after register allocation.</para> |
| 981 | </listitem> |
| 982 | |
| 983 | <listitem> |
| 984 | <para><computeroutput>RealReg</computeroutput>: the field |
| 985 | contains a number in the range 0 to 7 denoting an integer x86 |
| 986 | ("real") register on the host. The number is the Intel |
| 987 | encoding for integer registers. Such tags only exist after |
| 988 | register allocation.</para> |
| 989 | </listitem> |
| 990 | |
| 991 | <listitem> |
| 992 | <para><computeroutput>ArchReg</computeroutput>: the field |
| 993 | contains a number in the range 0 to 7 denoting an integer x86 |
| 994 | register on the simulated CPU. In reality this means a |
| 995 | reference to one of the first 8 words of |
| 996 | <computeroutput>VG_(baseBlock)</computeroutput>. Such tags |
| 997 | can exist at any point in the translation process.</para> |
| 998 | </listitem> |
| 999 | |
| 1000 | <listitem> |
| 1001 | <para>Last, but not least, |
| 1002 | <computeroutput>TempReg</computeroutput>. The field contains |
| 1003 | the number of one of an infinite set of virtual (integer) |
| 1004 | registers. <computeroutput>TempReg</computeroutput>s are used |
| 1005 | everywhere throughout the translation process; you can have |
| 1006 | as many as you want. The register allocator maps as many as |
| 1007 | it can into <computeroutput>RealReg</computeroutput>s and |
| 1008 | turns the rest into |
| 1009 | <computeroutput>SpillNo</computeroutput>s, so |
| 1010 | <computeroutput>TempReg</computeroutput>s should not exist |
| 1011 | after the register allocation phase.</para> |
| 1012 | |
| 1013 | <para><computeroutput>TempReg</computeroutput>s are always 32 |
| 1014 | bits long, even if the data they hold is logically shorter. |
| 1015 | In that case the upper unused bits are required, and, I |
| 1016 | think, generally assumed, to be zero. |
| 1017 | <computeroutput>TempReg</computeroutput>s holding V bits for |
| 1018 | quantities shorter than 32 bits are expected to have ones in |
| 1019 | the unused places, since a one denotes "undefined".</para> |
| 1020 | </listitem> |
| 1021 | |
| 1022 | </itemizedlist> |
| 1023 | |
| 1024 | </sect2> |
| 1025 | |
| 1026 | |
| 1027 | |
| 1028 | <sect2 id="mc-tech-docs.uinstr" |
| 1029 | xreflabel="UCode instructions: type 'UInstr'"> |
| 1030 | <title>UCode instructions: type <computeroutput>UInstr</computeroutput></title> |
| 1031 | |
| 1032 | <para>UCode was carefully designed to make it possible to do |
| 1033 | register allocation on UCode and then translate the result into |
| 1034 | x86 code without needing any extra registers ... well, that was |
| 1035 | the original plan, anyway. Things have gotten a little more |
| 1036 | complicated since then. In what follows, UCode instructions are |
| 1037 | referred to as uinstrs, to distinguish them from x86 |
| 1038 | instructions. Uinstrs of course have uopcodes which are |
| 1039 | (naturally) different from x86 opcodes.</para> |
| 1040 | |
| 1041 | <para>A uinstr (type <computeroutput>UInstr</computeroutput>) |
| 1042 | contains various fields, not all of which are used by any one |
| 1043 | uopcode:</para> |
| 1044 | |
| 1045 | <itemizedlist> |
| 1046 | |
| 1047 | <listitem> |
| 1048 | <para>Three 16-bit operand fields, |
| 1049 | <computeroutput>val1</computeroutput>, |
| 1050 | <computeroutput>val2</computeroutput> and |
| 1051 | <computeroutput>val3</computeroutput>.</para> |
| 1052 | </listitem> |
| 1053 | |
| 1054 | <listitem> |
| 1055 | <para>Three tag fields, |
| 1056 | <computeroutput>tag1</computeroutput>, |
| 1057 | <computeroutput>tag2</computeroutput> and |
| 1058 | <computeroutput>tag3</computeroutput>. Each of these has a |
| 1059 | value of type <computeroutput>Tag</computeroutput>, and they |
| 1060 | describe what the <computeroutput>val1</computeroutput>, |
| 1061 | <computeroutput>val2</computeroutput> and |
| 1062 | <computeroutput>val3</computeroutput> fields contain.</para> |
| 1063 | </listitem> |
| 1064 | |
| 1065 | <listitem> |
| 1066 | <para>A 32-bit literal field.</para> |
| 1067 | </listitem> |
| 1068 | |
| 1069 | <listitem> |
| 1070 | <para>Two <computeroutput>FlagSet</computeroutput>s, |
| 1071 | specifying which x86 condition codes are read and written by |
| 1072 | the uinstr.</para> |
| 1073 | </listitem> |
| 1074 | |
| 1075 | <listitem> |
| 1076 | <para>An opcode byte, containing a value of type |
| 1077 | <computeroutput>Opcode</computeroutput>.</para> |
| 1078 | </listitem> |
| 1079 | |
| 1080 | <listitem> |
| 1081 | <para>A size field, indicating the data transfer size |
| 1082 | (1/2/4/8/10) in cases where this makes sense, or zero |
| 1083 | otherwise.</para> |
| 1084 | </listitem> |
| 1085 | |
| 1086 | <listitem> |
| 1087 | <para>A condition-code field, which, for jumps, holds a value |
| 1088 | of type <computeroutput>Condcode</computeroutput>, indicating |
| 1089 | the condition which applies. The encoding is as it is in the |
| 1090 | x86 insn stream, except we add a 17th value |
| 1091 | <computeroutput>CondAlways</computeroutput> to indicate an |
| 1092 | unconditional transfer.</para> |
| 1093 | </listitem> |
| 1094 | |
| 1095 | <listitem> |
| 1096 | <para>Various 1-bit flags, indicating whether this insn |
| 1097 | pertains to an x86 CALL or RET instruction, whether a |
| 1098 | widening is signed or not, etc.</para> |
| 1099 | </listitem> |
| 1100 | |
| 1101 | </itemizedlist> |
| 1102 | |
| 1103 | <para>UOpcodes (type <computeroutput>Opcode</computeroutput>) are |
| 1104 | divided into two groups: those necessary merely to express the |
| 1105 | functionality of the x86 code, and extra uopcodes needed to |
| 1106 | express the instrumentation. The former group contains:</para> |
| 1107 | |
| 1108 | <itemizedlist> |
| 1109 | |
| 1110 | <listitem> |
| 1111 | <para><computeroutput>GET</computeroutput> and |
| 1112 | <computeroutput>PUT</computeroutput>, which move values from |
| 1113 | the simulated CPU's integer registers |
| 1114 | (<computeroutput>ArchReg</computeroutput>s) into |
| 1115 | <computeroutput>TempReg</computeroutput>s, and back. |
| 1116 | <computeroutput>GETF</computeroutput> and |
| 1117 | <computeroutput>PUTF</computeroutput> do the corresponding |
| 1118 | thing for the simulated |
| 1119 | <computeroutput>%EFLAGS</computeroutput>. There are no |
| 1120 | corresponding insns for the FPU register stack, since we |
| 1121 | don't explicitly simulate its registers.</para> |
| 1122 | </listitem> |
| 1123 | |
| 1124 | <listitem> |
| 1125 | <para><computeroutput>LOAD</computeroutput> and |
| 1126 | <computeroutput>STORE</computeroutput>, which, in RISC-like |
| 1127 | fashion, are the only uinstrs able to interact with |
| 1128 | memory.</para> |
| 1129 | </listitem> |
| 1130 | |
| 1131 | <listitem> |
| 1132 | <para><computeroutput>MOV</computeroutput> and |
| 1133 | <computeroutput>CMOV</computeroutput> allow unconditional and |
| 1134 | conditional moves of values between |
| 1135 | <computeroutput>TempReg</computeroutput>s.</para> |
| 1136 | </listitem> |
| 1137 | |
| 1138 | <listitem> |
| 1139 | <para>ALU operations. Again in RISC-like fashion, these only |
| 1140 | operate on <computeroutput>TempReg</computeroutput>s (before |
| 1141 | reg-alloc) or <computeroutput>RealReg</computeroutput>s |
| 1142 | (after reg-alloc). These are: |
| 1143 | <computeroutput>ADD</computeroutput>, |
| 1144 | <computeroutput>ADC</computeroutput>, |
| 1145 | <computeroutput>AND</computeroutput>, |
| 1146 | <computeroutput>OR</computeroutput>, |
| 1147 | <computeroutput>XOR</computeroutput>, |
| 1148 | <computeroutput>SUB</computeroutput>, |
| 1149 | <computeroutput>SBB</computeroutput>, |
| 1150 | <computeroutput>SHL</computeroutput>, |
| 1151 | <computeroutput>SHR</computeroutput>, |
| 1152 | <computeroutput>SAR</computeroutput>, |
| 1153 | <computeroutput>ROL</computeroutput>, |
| 1154 | <computeroutput>ROR</computeroutput>, |
| 1155 | <computeroutput>RCL</computeroutput>, |
| 1156 | <computeroutput>RCR</computeroutput>, |
| 1157 | <computeroutput>NOT</computeroutput>, |
| 1158 | <computeroutput>NEG</computeroutput>, |
| 1159 | <computeroutput>INC</computeroutput>, |
| 1160 | <computeroutput>DEC</computeroutput>, |
| 1161 | <computeroutput>BSWAP</computeroutput>, |
| 1162 | <computeroutput>CC2VAL</computeroutput> and |
| 1163 | <computeroutput>WIDEN</computeroutput>. |
| 1164 | <computeroutput>WIDEN</computeroutput> does signed or |
| 1165 | unsigned value widening. |
| 1166 | <computeroutput>CC2VAL</computeroutput> is used to convert |
| 1167 | condition codes into a value, zero or one. The rest are |
| 1168 | obvious.</para> |
| 1169 | |
| 1170 | <para>To allow for more efficient code generation, we bend |
| 1171 | slightly the restriction at the start of the previous para: |
| 1172 | for <computeroutput>ADD</computeroutput>, |
| 1173 | <computeroutput>ADC</computeroutput>, |
| 1174 | <computeroutput>XOR</computeroutput>, |
| 1175 | <computeroutput>SUB</computeroutput> and |
| 1176 | <computeroutput>SBB</computeroutput>, we allow the first |
| 1177 | (source) operand to also be an |
| 1178 | <computeroutput>ArchReg</computeroutput>, that is, one of the |
| 1179 | simulated machine's registers. Also, many of these ALU ops |
| 1180 | allow the source operand to be a literal. See |
| 1181 | <computeroutput>VG_(saneUInstr)</computeroutput> for the |
| 1182 | final word on the allowable forms of uinstrs.</para> |
| 1183 | </listitem> |
| 1184 | |
| 1185 | <listitem> |
| 1186 | <para><computeroutput>LEA1</computeroutput> and |
| 1187 | <computeroutput>LEA2</computeroutput> are not strictly |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 1188 | necessary, but facilitate better translations. They |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 1189 | record the fancy x86 addressing modes in a direct way, which |
| 1190 | allows those amodes to be emitted back into the final |
| 1191 | instruction stream more or less verbatim.</para> |
| 1192 | </listitem> |
| 1193 | |
| 1194 | <listitem> |
| 1195 | <para><computeroutput>CALLM</computeroutput> calls a |
| 1196 | machine-code helper, one of the methods whose address is |
| 1197 | stored at some |
| 1198 | <computeroutput>VG_(baseBlock)</computeroutput> offset. |
| 1199 | <computeroutput>PUSH</computeroutput> and |
| 1200 | <computeroutput>POP</computeroutput> move values to/from |
| 1201 | <computeroutput>TempReg</computeroutput> to the real |
| 1202 | (Valgrind's) stack, and |
| 1203 | <computeroutput>CLEAR</computeroutput> removes values from |
| 1204 | the stack. <computeroutput>CALLM_S</computeroutput> and |
| 1205 | <computeroutput>CALLM_E</computeroutput> delimit the |
| 1206 | boundaries of call setups and clearings, for the benefit of |
| 1207 | the instrumentation passes. Getting this right is critical, |
| 1208 | and so <computeroutput>VG_(saneUCodeBlock)</computeroutput> |
| 1209 | makes various checks on the use of these uopcodes.</para> |
| 1210 | |
| 1211 | <para>It is important to understand that these uopcodes have |
| 1212 | nothing to do with the x86 |
| 1213 | <computeroutput>call</computeroutput>, |
| 1214 | <computeroutput>return,</computeroutput> |
| 1215 | <computeroutput>push</computeroutput> or |
| 1216 | <computeroutput>pop</computeroutput> instructions, and are |
| 1217 | not used to implement them. Those guys turn into |
| 1218 | combinations of <computeroutput>GET</computeroutput>, |
| 1219 | <computeroutput>PUT</computeroutput>, |
| 1220 | <computeroutput>LOAD</computeroutput>, |
| 1221 | <computeroutput>STORE</computeroutput>, |
| 1222 | <computeroutput>ADD</computeroutput>, |
| 1223 | <computeroutput>SUB</computeroutput>, and |
| 1224 | <computeroutput>JMP</computeroutput>. What these uopcodes |
| 1225 | support is calling of helper functions such as |
| 1226 | <computeroutput>VG_(helper_imul_32_64)</computeroutput>, |
| 1227 | which do stuff which is too difficult or tedious to emit |
| 1228 | inline.</para> |
| 1229 | </listitem> |
| 1230 | |
| 1231 | <listitem> |
| 1232 | <para><computeroutput>FPU</computeroutput>, |
| 1233 | <computeroutput>FPU_R</computeroutput> and |
| 1234 | <computeroutput>FPU_W</computeroutput>. Valgrind doesn't |
| 1235 | attempt to simulate the internal state of the FPU at all. |
| 1236 | Consequently it only needs to be able to distinguish FPU ops |
| 1237 | which read and write memory from those that don't, and for |
| 1238 | those which do, it needs to know the effective address and |
| 1239 | data transfer size. This is made easier because the x86 FP |
| 1240 | instruction encoding is very regular, basically consisting of |
| 1241 | 16 bits for a non-memory FPU insn and 11 (IIRC) bits + an |
| 1242 | address mode for a memory FPU insn. So our |
| 1243 | <computeroutput>FPU</computeroutput> uinstr carries the 16 |
| 1244 | bits in its <computeroutput>val1</computeroutput> field. And |
| 1245 | <computeroutput>FPU_R</computeroutput> and |
| 1246 | <computeroutput>FPU_W</computeroutput> carry 11 bits in that |
| 1247 | field, together with the identity of a |
| 1248 | <computeroutput>TempReg</computeroutput> or (later) |
| 1249 | <computeroutput>RealReg</computeroutput> which contains the |
| 1250 | address.</para> |
| 1251 | </listitem> |
| 1252 | |
| 1253 | <listitem> |
| 1254 | <para><computeroutput>JIFZ</computeroutput> is unique, in |
| 1255 | that it allows a control-flow transfer which is not deemed to |
| 1256 | end a basic block. It causes a jump to a literal (original) |
| 1257 | address if the specified argument is zero.</para> |
| 1258 | </listitem> |
| 1259 | |
| 1260 | <listitem> |
| 1261 | <para>Finally, <computeroutput>INCEIP</computeroutput> |
| 1262 | advances the simulated <computeroutput>%EIP</computeroutput> |
| 1263 | by the specified literal amount. This supports lazy |
| 1264 | <computeroutput>%EIP</computeroutput> updating, as described |
| 1265 | below.</para> |
| 1266 | </listitem> |
| 1267 | |
| 1268 | </itemizedlist> |
| 1269 | |
| 1270 | <para>Stages 1 and 2 of the 6-stage translation process mentioned |
| 1271 | above deal purely with these uopcodes, and no others. They are |
| 1272 | sufficient to express pretty much all the x86 32-bit |
| 1273 | protected-mode instruction set, at least everything understood by |
| 1274 | a pre-MMX original Pentium (P54C).</para> |
| 1275 | |
| 1276 | <para>Stages 3, 4, 5 and 6 also deal with the following extra |
| 1277 | "instrumentation" uopcodes. They are used to express all the |
| 1278 | definedness-tracking and -checking machinery which valgrind does. |
| 1279 | In later sections we show how to create checking code for each of |
| 1280 | the uopcodes above. Note that these instrumentation uopcodes, |
| 1281 | although some appearing complicated, have been carefully chosen |
| 1282 | so that efficient x86 code can be generated for them. GNU |
| 1283 | superopt v2.5 did a great job helping out here. Anyways, the |
| 1284 | uopcodes are as follows:</para> |
| 1285 | |
| 1286 | <itemizedlist> |
| 1287 | |
| 1288 | <listitem> |
| 1289 | <para><computeroutput>GETV</computeroutput> and |
| 1290 | <computeroutput>PUTV</computeroutput> are analogues to |
| 1291 | <computeroutput>GET</computeroutput> and |
| 1292 | <computeroutput>PUT</computeroutput> above. They are |
| 1293 | identical except that they move the V bits for the specified |
| 1294 | values back and forth to |
| 1295 | <computeroutput>TempRegs</computeroutput>, rather than moving |
| 1296 | the values themselves.</para> |
| 1297 | </listitem> |
| 1298 | |
| 1299 | <listitem> |
| 1300 | <para>Similarly, <computeroutput>LOADV</computeroutput> and |
| 1301 | <computeroutput>STOREV</computeroutput> read and write V bits |
| 1302 | from the synthesised shadow memory that Valgrind maintains. |
| 1303 | In fact they do more than that, since they also do |
| 1304 | address-validity checks, and emit complaints if the |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 1305 | read/written addresses are unaddressable.</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 1306 | </listitem> |
| 1307 | |
| 1308 | <listitem> |
| 1309 | <para><computeroutput>TESTV</computeroutput>, whose |
| 1310 | parameters are a <computeroutput>TempReg</computeroutput> and |
| 1311 | a size, tests the V bits in the |
| 1312 | <computeroutput>TempReg</computeroutput>, at the specified |
| 1313 | operation size (0/1/2/4 byte) and emits an error if any of |
| 1314 | them indicate undefinedness. This is the only uopcode |
| 1315 | capable of doing such tests.</para> |
| 1316 | </listitem> |
| 1317 | |
| 1318 | <listitem> |
| 1319 | <para><computeroutput>SETV</computeroutput>, whose parameters |
| 1320 | are also <computeroutput>TempReg</computeroutput> and a size, |
| 1321 | makes the V bits in the |
| 1322 | <computeroutput>TempReg</computeroutput> indicated |
| 1323 | definedness, at the specified operation size. This is |
| 1324 | usually used to generate the correct V bits for a literal |
| 1325 | value, which is of course fully defined.</para> |
| 1326 | </listitem> |
| 1327 | |
| 1328 | <listitem> |
| 1329 | <para><computeroutput>GETVF</computeroutput> and |
| 1330 | <computeroutput>PUTVF</computeroutput> are analogues to |
| 1331 | <computeroutput>GETF</computeroutput> and |
| 1332 | <computeroutput>PUTF</computeroutput>. They move the single |
| 1333 | V bit used to model definedness of |
| 1334 | <computeroutput>%EFLAGS</computeroutput> between its home in |
| 1335 | <computeroutput>VG_(baseBlock)</computeroutput> and the |
| 1336 | specified <computeroutput>TempReg</computeroutput>.</para> |
| 1337 | </listitem> |
| 1338 | |
| 1339 | <listitem> |
| 1340 | <para><computeroutput>TAG1</computeroutput> denotes one of a |
| 1341 | family of unary operations on |
| 1342 | <computeroutput>TempReg</computeroutput>s containing V bits. |
| 1343 | Similarly, <computeroutput>TAG2</computeroutput> denotes one |
| 1344 | in a family of binary operations on V bits.</para> |
| 1345 | </listitem> |
| 1346 | |
| 1347 | </itemizedlist> |
| 1348 | |
| 1349 | |
| 1350 | <para>These 10 uopcodes are sufficient to express Valgrind's |
| 1351 | entire definedness-checking semantics. In fact most of the |
| 1352 | interesting magic is done by the |
| 1353 | <computeroutput>TAG1</computeroutput> and |
| 1354 | <computeroutput>TAG2</computeroutput> suboperations.</para> |
| 1355 | |
| 1356 | <para>First, however, I need to explain about V-vector operation |
| 1357 | sizes. There are 4 sizes: 1, 2 and 4, which operate on groups of |
| 1358 | 8, 16 and 32 V bits at a time, supporting the usual 1, 2 and 4 |
| 1359 | byte x86 operations. However there is also the mysterious size |
| 1360 | 0, which really means a single V bit. Single V bits are used in |
| 1361 | various circumstances; in particular, the definedness of |
| 1362 | <computeroutput>%EFLAGS</computeroutput> is modelled with a |
| 1363 | single V bit. Now might be a good time to also point out that |
| 1364 | for V bits, 1 means "undefined" and 0 means "defined". |
| 1365 | Similarly, for A bits, 1 means "invalid address" and 0 means |
| 1366 | "valid address". This seems counterintuitive (and so it is), but |
| 1367 | testing against zero on x86s saves instructions compared to |
| 1368 | testing against all 1s, because many ALU operations set the Z |
| 1369 | flag for free, so to speak.</para> |
| 1370 | |
| 1371 | <para>With that in mind, the tag ops are:</para> |
| 1372 | |
| 1373 | <itemizedlist> |
| 1374 | |
| 1375 | <listitem> |
| 1376 | <formalpara> |
| 1377 | <title>(UNARY) Pessimising casts:</title> |
| 1378 | <para><computeroutput>VgT_PCast40</computeroutput>, |
| 1379 | <computeroutput>VgT_PCast20</computeroutput>, |
| 1380 | <computeroutput>VgT_PCast10</computeroutput>, |
| 1381 | <computeroutput>VgT_PCast01</computeroutput>, |
| 1382 | <computeroutput>VgT_PCast02</computeroutput> and |
| 1383 | <computeroutput>VgT_PCast04</computeroutput>. A "pessimising |
| 1384 | cast" takes a V-bit vector at one size, and creates a new one |
| 1385 | at another size, pessimised in the sense that if any of the |
| 1386 | bits in the source vector indicate undefinedness, then all |
| 1387 | the bits in the result indicate undefinedness. In this case |
| 1388 | the casts are all to or from a single V bit, so for example |
| 1389 | <computeroutput>VgT_PCast40</computeroutput> is a pessimising |
| 1390 | cast from 32 bits to 1, whereas |
| 1391 | <computeroutput>VgT_PCast04</computeroutput> simply copies |
| 1392 | the single source V bit into all 32 bit positions in the |
| 1393 | result. Surprisingly, these ops can all be implemented very |
| 1394 | efficiently.</para> |
| 1395 | </formalpara> |
| 1396 | |
| 1397 | <para>There are also the pessimising casts |
| 1398 | <computeroutput>VgT_PCast14</computeroutput>, from 8 bits to |
| 1399 | 32, <computeroutput>VgT_PCast12</computeroutput>, from 8 bits |
| 1400 | to 16, and <computeroutput>VgT_PCast11</computeroutput>, from |
| 1401 | 8 bits to 8. This last one seems nonsensical, but in fact it |
| 1402 | isn't a no-op because, as mentioned above, any undefined (1) |
| 1403 | bits in the source infect the entire result.</para> |
| 1404 | </listitem> |
| 1405 | |
| 1406 | <listitem> |
| 1407 | <formalpara> |
| 1408 | <title>(UNARY) Propagating undefinedness upwards in a |
| 1409 | word:</title> |
| 1410 | <para><computeroutput>VgT_Left4</computeroutput>, |
| 1411 | <computeroutput>VgT_Left2</computeroutput> and |
| 1412 | <computeroutput>VgT_Left1</computeroutput>. These are used |
| 1413 | to simulate the worst-case effects of carry propagation in |
| 1414 | adds and subtracts. They return a V vector identical to the |
| 1415 | original, except that if the original contained any undefined |
| 1416 | bits, then it and all bits above it are marked as undefined |
| 1417 | too. Hence the Left bit in the names.</para></formalpara> |
| 1418 | </listitem> |
| 1419 | |
| 1420 | <listitem> |
| 1421 | <formalpara> |
| 1422 | <title>(UNARY) Signed and unsigned value widening:</title> |
| 1423 | <para><computeroutput>VgT_SWiden14</computeroutput>, |
| 1424 | <computeroutput>VgT_SWiden24</computeroutput>, |
| 1425 | <computeroutput>VgT_SWiden12</computeroutput>, |
| 1426 | <computeroutput>VgT_ZWiden14</computeroutput>, |
| 1427 | <computeroutput>VgT_ZWiden24</computeroutput> and |
| 1428 | <computeroutput>VgT_ZWiden12</computeroutput>. These mimic |
| 1429 | the definedness effects of standard signed and unsigned |
| 1430 | integer widening. Unsigned widening creates zero bits in the |
| 1431 | new positions, so |
| 1432 | <computeroutput>VgT_ZWiden*</computeroutput> accordingly park |
| 1433 | mark those parts of their argument as defined. Signed |
| 1434 | widening copies the sign bit into the new positions, so |
| 1435 | <computeroutput>VgT_SWiden*</computeroutput> copies the |
| 1436 | definedness of the sign bit into the new positions. Because |
| 1437 | 1 means undefined and 0 means defined, these operations can |
| 1438 | (fascinatingly) be done by the same operations which they |
| 1439 | mimic. Go figure.</para> |
| 1440 | </formalpara> |
| 1441 | </listitem> |
| 1442 | |
| 1443 | <listitem> |
| 1444 | <formalpara> |
| 1445 | <title>(BINARY) Undefined-if-either-Undefined, |
| 1446 | Defined-if-either-Defined:</title> |
| 1447 | <para><computeroutput>VgT_UifU4</computeroutput>, |
| 1448 | <computeroutput>VgT_UifU2</computeroutput>, |
| 1449 | <computeroutput>VgT_UifU1</computeroutput>, |
| 1450 | <computeroutput>VgT_UifU0</computeroutput>, |
| 1451 | <computeroutput>VgT_DifD4</computeroutput>, |
| 1452 | <computeroutput>VgT_DifD2</computeroutput>, |
| 1453 | <computeroutput>VgT_DifD1</computeroutput>. These do simple |
| 1454 | bitwise operations on pairs of V-bit vectors, with |
| 1455 | <computeroutput>UifU</computeroutput> giving undefined if |
| 1456 | either arg bit is undefined, and |
| 1457 | <computeroutput>DifD</computeroutput> giving defined if |
| 1458 | either arg bit is defined. Abstract interpretation junkies, |
| 1459 | if any make it this far, may like to think of them as meets |
| 1460 | and joins (or is it joins and meets) in the definedness |
| 1461 | lattices.</para> |
| 1462 | </formalpara> |
| 1463 | </listitem> |
| 1464 | |
| 1465 | <listitem> |
| 1466 | <formalpara> |
| 1467 | <title>(BINARY; one value, one V bits) Generate argument |
| 1468 | improvement terms for AND and OR</title> |
| 1469 | <para><computeroutput>VgT_ImproveAND4_TQ</computeroutput>, |
| 1470 | <computeroutput>VgT_ImproveAND2_TQ</computeroutput>, |
| 1471 | <computeroutput>VgT_ImproveAND1_TQ</computeroutput>, |
| 1472 | <computeroutput>VgT_ImproveOR4_TQ</computeroutput>, |
| 1473 | <computeroutput>VgT_ImproveOR2_TQ</computeroutput>, |
| 1474 | <computeroutput>VgT_ImproveOR1_TQ</computeroutput>. These |
| 1475 | help out with AND and OR operations. AND and OR have the |
| 1476 | inconvenient property that the definedness of the result |
| 1477 | depends on the actual values of the arguments as well as |
| 1478 | their definedness. At the bit level:</para></formalpara> |
| 1479 | <programlisting><![CDATA[ |
| 1480 | 1 AND undefined = undefined, but |
| 1481 | 0 AND undefined = 0, and |
| 1482 | similarly |
| 1483 | 0 OR undefined = undefined, but |
| 1484 | 1 OR undefined = 1.]]></programlisting> |
| 1485 | |
| 1486 | <para>It turns out that gcc (quite legitimately) generates |
| 1487 | code which relies on this fact, so we have to model it |
| 1488 | properly in order to avoid flooding users with spurious value |
| 1489 | errors. The ultimate definedness result of AND and OR is |
| 1490 | calculated using <computeroutput>UifU</computeroutput> on the |
| 1491 | definedness of the arguments, but we also |
| 1492 | <computeroutput>DifD</computeroutput> in some "improvement" |
| 1493 | terms which take into account the above phenomena.</para> |
| 1494 | |
| 1495 | <para><computeroutput>ImproveAND</computeroutput> takes as |
| 1496 | its first argument the actual value of an argument to AND |
| 1497 | (the T) and the definedness of that argument (the Q), and |
| 1498 | returns a V-bit vector which is defined (0) for bits which |
| 1499 | have value 0 and are defined; this, when |
| 1500 | <computeroutput>DifD</computeroutput> into the final result |
| 1501 | causes those bits to be defined even if the corresponding bit |
| 1502 | in the other argument is undefined.</para> |
| 1503 | |
| 1504 | <para>The <computeroutput>ImproveOR</computeroutput> ops do |
| 1505 | the dual thing for OR arguments. Note that XOR does not have |
| 1506 | this property that one argument can make the other |
| 1507 | irrelevant, so there is no need for such complexity for |
| 1508 | XOR.</para> |
| 1509 | </listitem> |
| 1510 | |
| 1511 | </itemizedlist> |
| 1512 | |
| 1513 | <para>That's all the tag ops. If you stare at this long enough, |
| 1514 | and then run Valgrind and stare at the pre- and post-instrumented |
| 1515 | ucode, it should be fairly obvious how the instrumentation |
| 1516 | machinery hangs together.</para> |
| 1517 | |
| 1518 | <para>One point, if you do this: in order to make it easy to |
| 1519 | differentiate <computeroutput>TempReg</computeroutput>s carrying |
| 1520 | values from <computeroutput>TempReg</computeroutput>s carrying V |
| 1521 | bit vectors, Valgrind prints the former as (for example) |
| 1522 | <computeroutput>t28</computeroutput> and the latter as |
| 1523 | <computeroutput>q28</computeroutput>; the fact that they carry |
| 1524 | the same number serves to indicate their relationship. This is |
| 1525 | purely for the convenience of the human reader; the register |
| 1526 | allocator and code generator don't regard them as |
| 1527 | different.</para> |
| 1528 | |
| 1529 | </sect2> |
| 1530 | |
| 1531 | |
| 1532 | |
de | ccde45e | 2005-06-12 10:23:23 +0000 | [diff] [blame] | 1533 | <sect2 id="mc-tech-docs.trans" xreflabel="Translation into UCode"> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 1534 | <title>Translation into UCode</title> |
| 1535 | |
| 1536 | <para><computeroutput>VG_(disBB)</computeroutput> allocates a new |
| 1537 | <computeroutput>UCodeBlock</computeroutput> and then uses |
| 1538 | <computeroutput>disInstr</computeroutput> to translate x86 |
| 1539 | instructions one at a time into UCode, dumping the result in the |
| 1540 | <computeroutput>UCodeBlock</computeroutput>. This goes on until |
| 1541 | a control-flow transfer instruction is encountered.</para> |
| 1542 | |
| 1543 | <para>Despite the large size of |
| 1544 | <filename>vg_to_ucode.c</filename>, this translation is really |
| 1545 | very simple. Each x86 instruction is translated entirely |
| 1546 | independently of its neighbours, merrily allocating new |
| 1547 | <computeroutput>TempReg</computeroutput>s as it goes. The idea |
| 1548 | is to have a simple translator -- in reality, no more than a |
| 1549 | macro-expander -- and the -- resulting bad UCode translation is |
| 1550 | cleaned up by the UCode optimisation phase which follows. To |
| 1551 | give you an idea of some x86 instructions and their translations |
| 1552 | (this is a complete basic block, as Valgrind sees it):</para> |
| 1553 | <programlisting><![CDATA[ |
| 1554 | 0x40435A50: incl %edx |
| 1555 | 0: GETL %EDX, t0 |
| 1556 | 1: INCL t0 (-wOSZAP) |
| 1557 | 2: PUTL t0, %EDX |
| 1558 | |
| 1559 | 0x40435A51: movsbl (%edx),%eax |
| 1560 | 3: GETL %EDX, t2 |
| 1561 | 4: LDB (t2), t2 |
| 1562 | 5: WIDENL_Bs t2 |
| 1563 | 6: PUTL t2, %EAX |
| 1564 | |
| 1565 | 0x40435A54: testb $0x20, 1(%ecx,%eax,2) |
| 1566 | 7: GETL %EAX, t6 |
| 1567 | 8: GETL %ECX, t8 |
| 1568 | 9: LEA2L 1(t8,t6,2), t4 |
| 1569 | 10: LDB (t4), t10 |
| 1570 | 11: MOVB $0x20, t12 |
| 1571 | 12: ANDB t12, t10 (-wOSZACP) |
| 1572 | 13: INCEIPo $9 |
| 1573 | |
| 1574 | 0x40435A59: jnz-8 0x40435A50 |
| 1575 | 14: Jnzo $0x40435A50 (-rOSZACP) |
| 1576 | 15: JMPo $0x40435A5B]]></programlisting> |
| 1577 | |
| 1578 | <para>Notice how the block always ends with an unconditional jump |
| 1579 | to the next block. This is a bit unnecessary, but makes many |
| 1580 | things simpler.</para> |
| 1581 | |
| 1582 | <para>Most x86 instructions turn into sequences of |
| 1583 | <computeroutput>GET</computeroutput>, |
| 1584 | <computeroutput>PUT</computeroutput>, |
| 1585 | <computeroutput>LEA1</computeroutput>, |
| 1586 | <computeroutput>LEA2</computeroutput>, |
| 1587 | <computeroutput>LOAD</computeroutput> and |
| 1588 | <computeroutput>STORE</computeroutput>. Some complicated ones |
| 1589 | however rely on calling helper bits of code in |
| 1590 | <filename>vg_helpers.S</filename>. The ucode instructions |
| 1591 | <computeroutput>PUSH</computeroutput>, |
| 1592 | <computeroutput>POP</computeroutput>, |
| 1593 | <computeroutput>CALL</computeroutput>, |
| 1594 | <computeroutput>CALLM_S</computeroutput> and |
| 1595 | <computeroutput>CALLM_E</computeroutput> support this. The |
| 1596 | calling convention is somewhat ad-hoc and is not the C calling |
| 1597 | convention. The helper routines must save all integer registers, |
| 1598 | and the flags, that they use. Args are passed on the stack |
| 1599 | underneath the return address, as usual, and if result(s) are to |
| 1600 | be returned, it (they) are either placed in dummy arg slots |
| 1601 | created by the ucode <computeroutput>PUSH</computeroutput> |
| 1602 | sequence, or just overwrite the incoming args.</para> |
| 1603 | |
| 1604 | <para>In order that the instrumentation mechanism can handle |
| 1605 | calls to these helpers, |
| 1606 | <computeroutput>VG_(saneUCodeBlock)</computeroutput> enforces the |
| 1607 | following restrictions on calls to helpers:</para> |
| 1608 | |
| 1609 | <itemizedlist> |
| 1610 | |
| 1611 | <listitem> |
| 1612 | <para>Each <computeroutput>CALL</computeroutput> uinstr must |
| 1613 | be bracketed by a preceding |
| 1614 | <computeroutput>CALLM_S</computeroutput> marker (dummy |
| 1615 | uinstr) and a trailing |
| 1616 | <computeroutput>CALLM_E</computeroutput> marker. These |
| 1617 | markers are used by the instrumentation mechanism later to |
| 1618 | establish the boundaries of the |
| 1619 | <computeroutput>PUSH</computeroutput>, |
| 1620 | <computeroutput>POP</computeroutput> and |
| 1621 | <computeroutput>CLEAR</computeroutput> sequences for the |
| 1622 | call.</para> |
| 1623 | </listitem> |
| 1624 | |
| 1625 | <listitem> |
| 1626 | <para><computeroutput>PUSH</computeroutput>, |
| 1627 | <computeroutput>POP</computeroutput> and |
| 1628 | <computeroutput>CLEAR</computeroutput> may only appear inside |
| 1629 | sections bracketed by |
| 1630 | <computeroutput>CALLM_S</computeroutput> and |
| 1631 | <computeroutput>CALLM_E</computeroutput>, and nowhere else.</para> |
| 1632 | </listitem> |
| 1633 | |
| 1634 | <listitem> |
| 1635 | <para>In any such bracketed section, no two |
| 1636 | <computeroutput>PUSH</computeroutput> insns may push the same |
| 1637 | <computeroutput>TempReg</computeroutput>. Dually, no two two |
| 1638 | <computeroutput>POP</computeroutput>s may pop the same |
| 1639 | <computeroutput>TempReg</computeroutput>.</para> |
| 1640 | </listitem> |
| 1641 | |
| 1642 | <listitem> |
| 1643 | <para>Finally, although this is not checked, args should be |
| 1644 | removed from the stack with |
| 1645 | <computeroutput>CLEAR</computeroutput>, rather than |
| 1646 | <computeroutput>POP</computeroutput>s into a |
| 1647 | <computeroutput>TempReg</computeroutput> which is not |
| 1648 | subsequently used. This is because the instrumentation |
| 1649 | mechanism assumes that all values |
| 1650 | <computeroutput>POP</computeroutput>ped from the stack are |
| 1651 | actually used.</para> |
| 1652 | </listitem> |
| 1653 | |
| 1654 | </itemizedlist> |
| 1655 | |
| 1656 | <para>Some of the translations may appear to have redundant |
| 1657 | <computeroutput>TempReg</computeroutput>-to-<computeroutput>TempReg</computeroutput> |
| 1658 | moves. This helps the next phase, UCode optimisation, to |
| 1659 | generate better code.</para> |
| 1660 | |
| 1661 | </sect2> |
| 1662 | |
| 1663 | |
| 1664 | |
| 1665 | <sect2 id="mc-tech-docs.optim" xreflabel="UCode optimisation"> |
| 1666 | <title>UCode optimisation</title> |
| 1667 | |
| 1668 | <para>UCode is then subjected to an improvement pass |
| 1669 | (<computeroutput>vg_improve()</computeroutput>), which blurs the |
| 1670 | boundaries between the translations of the original x86 |
| 1671 | instructions. It's pretty straightforward. Three |
| 1672 | transformations are done:</para> |
| 1673 | |
| 1674 | <itemizedlist> |
| 1675 | |
| 1676 | <listitem> |
| 1677 | <para>Redundant <computeroutput>GET</computeroutput> |
| 1678 | elimination. Actually, more general than that -- eliminates |
| 1679 | redundant fetches of ArchRegs. In our running example, |
| 1680 | uinstr 3 <computeroutput>GET</computeroutput>s |
| 1681 | <computeroutput>%EDX</computeroutput> into |
| 1682 | <computeroutput>t2</computeroutput> despite the fact that, by |
| 1683 | looking at the previous uinstr, it is already in |
| 1684 | <computeroutput>t0</computeroutput>. The |
| 1685 | <computeroutput>GET</computeroutput> is therefore removed, |
| 1686 | and <computeroutput>t2</computeroutput> renamed to |
| 1687 | <computeroutput>t0</computeroutput>. Assuming |
| 1688 | <computeroutput>t0</computeroutput> is allocated to a host |
| 1689 | register, it means the simulated |
| 1690 | <computeroutput>%EDX</computeroutput> will exist in a host |
| 1691 | CPU register for more than one simulated x86 instruction, |
| 1692 | which seems to me to be a highly desirable property.</para> |
| 1693 | |
| 1694 | <para>There is some mucking around to do with subregisters; |
| 1695 | <computeroutput>%AL</computeroutput> vs |
| 1696 | <computeroutput>%AH</computeroutput> |
| 1697 | <computeroutput>%AX</computeroutput> vs |
| 1698 | <computeroutput>%EAX</computeroutput> etc. I can't remember |
| 1699 | how it works, but in general we are very conservative, and |
| 1700 | these tend to invalidate the caching.</para> |
| 1701 | </listitem> |
| 1702 | |
| 1703 | <listitem> |
| 1704 | <para>Redundant <computeroutput>PUT</computeroutput> |
| 1705 | elimination. This annuls |
| 1706 | <computeroutput>PUT</computeroutput>s of values back to |
| 1707 | simulated CPU registers if a later |
| 1708 | <computeroutput>PUT</computeroutput> would overwrite the |
| 1709 | earlier <computeroutput>PUT</computeroutput> value, and there |
| 1710 | is no intervening reads of the simulated register |
| 1711 | (<computeroutput>ArchReg</computeroutput>).</para> |
| 1712 | |
| 1713 | <para>As before, we are paranoid when faced with subregister |
| 1714 | references. Also, <computeroutput>PUT</computeroutput>s of |
| 1715 | <computeroutput>%ESP</computeroutput> are never annulled, |
| 1716 | because it is vital the instrumenter always has an up-to-date |
| 1717 | <computeroutput>%ESP</computeroutput> value available, |
| 1718 | <computeroutput>%ESP</computeroutput> changes affect |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 1719 | addressability of the memory around the simulated stack |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 1720 | pointer.</para> |
| 1721 | |
| 1722 | <para>The implication of the above paragraph is that the |
| 1723 | simulated machine's registers are only lazily updated once |
| 1724 | the above two optimisation phases have run, with the |
| 1725 | exception of <computeroutput>%ESP</computeroutput>. |
| 1726 | <computeroutput>TempReg</computeroutput>s go dead at the end |
| 1727 | of every basic block, from which is is inferrable that any |
| 1728 | <computeroutput>TempReg</computeroutput> caching a simulated |
| 1729 | CPU reg is flushed (back into the relevant |
| 1730 | <computeroutput>VG_(baseBlock)</computeroutput> slot) at the |
| 1731 | end of every basic block. The further implication is that |
| 1732 | the simulated registers are only up-to-date at in between |
| 1733 | basic blocks, and not at arbitrary points inside basic |
| 1734 | blocks. And the consequence of that is that we can only |
| 1735 | deliver signals to the client in between basic blocks. None |
| 1736 | of this seems any problem in practice.</para> |
| 1737 | </listitem> |
| 1738 | |
| 1739 | <listitem> |
| 1740 | <para>Finally there is a simple def-use thing for condition |
| 1741 | codes. If an earlier uinstr writes the condition codes, and |
| 1742 | the next uinsn along which actually cares about the condition |
| 1743 | codes writes the same or larger set of them, but does not |
| 1744 | read any, the earlier uinsn is marked as not writing any |
| 1745 | condition codes. This saves a lot of redundant cond-code |
| 1746 | saving and restoring.</para> |
| 1747 | </listitem> |
| 1748 | |
| 1749 | </itemizedlist> |
| 1750 | |
| 1751 | <para>The effect of these transformations on our short block is |
| 1752 | rather unexciting, and shown below. On longer basic blocks they |
| 1753 | can dramatically improve code quality.</para> |
| 1754 | |
| 1755 | <programlisting><![CDATA[ |
| 1756 | at 3: delete GET, rename t2 to t0 in (4 .. 6) |
| 1757 | at 7: delete GET, rename t6 to t0 in (8 .. 9) |
| 1758 | at 1: annul flag write OSZAP due to later OSZACP |
| 1759 | |
| 1760 | Improved code: |
| 1761 | 0: GETL %EDX, t0 |
| 1762 | 1: INCL t0 |
| 1763 | 2: PUTL t0, %EDX |
| 1764 | 4: LDB (t0), t0 |
| 1765 | 5: WIDENL_Bs t0 |
| 1766 | 6: PUTL t0, %EAX |
| 1767 | 8: GETL %ECX, t8 |
| 1768 | 9: LEA2L 1(t8,t0,2), t4 |
| 1769 | 10: LDB (t4), t10 |
| 1770 | 11: MOVB $0x20, t12 |
| 1771 | 12: ANDB t12, t10 (-wOSZACP) |
| 1772 | 13: INCEIPo $9 |
| 1773 | 14: Jnzo $0x40435A50 (-rOSZACP) |
| 1774 | 15: JMPo $0x40435A5B]]></programlisting> |
| 1775 | |
| 1776 | </sect2> |
| 1777 | |
| 1778 | |
| 1779 | |
| 1780 | <sect2 id="mc-tech-docs.instrum" xreflabel="UCode instrumentation"> |
| 1781 | <title>UCode instrumentation</title> |
| 1782 | |
| 1783 | <para>Once you understand the meaning of the instrumentation |
| 1784 | uinstrs, discussed in detail above, the instrumentation scheme is |
| 1785 | fairly straightforward. Each uinstr is instrumented in |
| 1786 | isolation, and the instrumentation uinstrs are placed before the |
| 1787 | original uinstr. Our running example continues below. I have |
| 1788 | placed a blank line after every original ucode, to make it easier |
| 1789 | to see which instrumentation uinstrs correspond to which |
| 1790 | originals.</para> |
| 1791 | |
| 1792 | <para>As mentioned somewhere above, |
| 1793 | <computeroutput>TempReg</computeroutput>s carrying values have |
| 1794 | names like <computeroutput>t28</computeroutput>, and each one has |
| 1795 | a shadow carrying its V bits, with names like |
| 1796 | <computeroutput>q28</computeroutput>. This pairing aids in |
| 1797 | reading instrumented ucode.</para> |
| 1798 | |
| 1799 | <para>One decision about all this is where to have "observation |
| 1800 | points", that is, where to check that V bits are valid. I use a |
| 1801 | minimalistic scheme, only checking where a failure of validity |
| 1802 | could cause the original program to (seg)fault. So the use of |
| 1803 | values as memory addresses causes a check, as do conditional |
| 1804 | jumps (these cause a check on the definedness of the condition |
| 1805 | codes). And arguments <computeroutput>PUSH</computeroutput>ed |
| 1806 | for helper calls are checked, hence the weird restrictions on |
| 1807 | help call preambles described above.</para> |
| 1808 | |
| 1809 | <para>Another decision is that once a value is tested, it is |
| 1810 | thereafter regarded as defined, so that we do not emit multiple |
| 1811 | undefined-value errors for the same undefined value. That means |
| 1812 | that <computeroutput>TESTV</computeroutput> uinstrs are always |
| 1813 | followed by <computeroutput>SETV</computeroutput> on the same |
| 1814 | (shadow) <computeroutput>TempReg</computeroutput>s. Most of |
| 1815 | these <computeroutput>SETV</computeroutput>s are redundant and |
| 1816 | are removed by the post-instrumentation cleanup phase.</para> |
| 1817 | |
| 1818 | <para>The instrumentation for calling helper functions deserves |
| 1819 | further comment. The definedness of results from a helper is |
| 1820 | modelled using just one V bit. So, in short, we do pessimising |
| 1821 | casts of the definedness of all the args, down to a single bit, |
| 1822 | and then <computeroutput>UifU</computeroutput> these bits |
| 1823 | together. So this single V bit will say "undefined" if any part |
| 1824 | of any arg is undefined. This V bit is then pessimally cast back |
| 1825 | up to the result(s) sizes, as needed. If, by seeing that all the |
| 1826 | args are got rid of with <computeroutput>CLEAR</computeroutput> |
| 1827 | and none with <computeroutput>POP</computeroutput>, Valgrind sees |
| 1828 | that the result of the call is not actually used, it immediately |
| 1829 | examines the result V bit with a |
| 1830 | <computeroutput>TESTV</computeroutput> -- |
| 1831 | <computeroutput>SETV</computeroutput> pair. If it did not do |
| 1832 | this, there would be no observation point to detect that the some |
| 1833 | of the args to the helper were undefined. Of course, if the |
| 1834 | helper's results are indeed used, we don't do this, since the |
| 1835 | result usage will presumably cause the result definedness to be |
| 1836 | checked at some suitable future point.</para> |
| 1837 | |
| 1838 | <para>In general Valgrind tries to track definedness on a |
| 1839 | bit-for-bit basis, but as the above para shows, for calls to |
| 1840 | helpers we throw in the towel and approximate down to a single |
| 1841 | bit. This is because it's too complex and difficult to track |
| 1842 | bit-level definedness through complex ops such as integer |
| 1843 | multiply and divide, and in any case there is no reasonable code |
| 1844 | fragments which attempt to (eg) multiply two partially-defined |
| 1845 | values and end up with something meaningful, so there seems |
| 1846 | little point in modelling multiplies, divides, etc, in that level |
| 1847 | of detail.</para> |
| 1848 | |
| 1849 | <para>Integer loads and stores are instrumented with firstly a |
| 1850 | test of the definedness of the address, followed by a |
| 1851 | <computeroutput>LOADV</computeroutput> or |
| 1852 | <computeroutput>STOREV</computeroutput> respectively. These turn |
| 1853 | into calls to (for example) |
| 1854 | <computeroutput>VG_(helperc_LOADV4)</computeroutput>. These |
| 1855 | helpers do two things: they perform an address-valid check, and |
| 1856 | they load or store V bits from/to the relevant address in the |
| 1857 | (simulated V-bit) memory.</para> |
| 1858 | |
| 1859 | <para>FPU loads and stores are different. As above the |
| 1860 | definedness of the address is first tested. However, the helper |
| 1861 | routine for FPU loads |
| 1862 | (<computeroutput>VGM_(fpu_read_check)</computeroutput>) emits an |
| 1863 | error if either the address is invalid or the referenced area |
| 1864 | contains undefined values. It has to do this because we do not |
| 1865 | simulate the FPU at all, and so cannot track definedness of |
| 1866 | values loaded into it from memory, so we have to check them as |
| 1867 | soon as they are loaded into the FPU, ie, at this point. We |
| 1868 | notionally assume that everything in the FPU is defined.</para> |
| 1869 | |
| 1870 | <para>It follows therefore that FPU writes first check the |
| 1871 | definedness of the address, then the validity of the address, and |
| 1872 | finally mark the written bytes as well-defined.</para> |
| 1873 | |
| 1874 | <para>If anyone is inspired to extend Valgrind to MMX/SSE insns, |
| 1875 | I suggest you use the same trick. It works provided that the |
| 1876 | FPU/MMX unit is not used to merely as a conduit to copy partially |
| 1877 | undefined data from one place in memory to another. |
| 1878 | Unfortunately the integer CPU is used like that (when copying C |
| 1879 | structs with holes, for example) and this is the cause of much of |
| 1880 | the elaborateness of the instrumentation here described.</para> |
| 1881 | |
| 1882 | <para><computeroutput>vg_instrument()</computeroutput> in |
| 1883 | <filename>vg_translate.c</filename> actually does the |
| 1884 | instrumentation. There are comments explaining how each uinstr |
| 1885 | is handled, so we do not repeat that here. As explained already, |
| 1886 | it is bit-accurate, except for calls to helper functions. |
| 1887 | Unfortunately the x86 insns |
| 1888 | <computeroutput>bt/bts/btc/btr</computeroutput> are done by |
| 1889 | helper fns, so bit-level accuracy is lost there. This should be |
| 1890 | fixed by doing them inline; it will probably require adding a |
| 1891 | couple new uinstrs. Also, left and right rotates through the |
| 1892 | carry flag (x86 <computeroutput>rcl</computeroutput> and |
| 1893 | <computeroutput>rcr</computeroutput>) are approximated via a |
| 1894 | single V bit; so far this has not caused anyone to complain. The |
| 1895 | non-carry rotates, <computeroutput>rol</computeroutput> and |
| 1896 | <computeroutput>ror</computeroutput>, are much more common and |
| 1897 | are done exactly. Re-visiting the instrumentation for AND and |
| 1898 | OR, they seem rather verbose, and I wonder if it could be done |
| 1899 | more concisely now.</para> |
| 1900 | |
| 1901 | <para>The lowercase <computeroutput>o</computeroutput> on many of |
| 1902 | the uopcodes in the running example indicates that the size field |
| 1903 | is zero, usually meaning a single-bit operation.</para> |
| 1904 | |
| 1905 | <para>Anyroads, the post-instrumented version of our running |
| 1906 | example looks like this:</para> |
| 1907 | |
| 1908 | <programlisting><![CDATA[ |
| 1909 | Instrumented code: |
| 1910 | 0: GETVL %EDX, q0 |
| 1911 | 1: GETL %EDX, t0 |
| 1912 | |
| 1913 | 2: TAG1o q0 = Left4 ( q0 ) |
| 1914 | 3: INCL t0 |
| 1915 | |
| 1916 | 4: PUTVL q0, %EDX |
| 1917 | 5: PUTL t0, %EDX |
| 1918 | |
| 1919 | 6: TESTVL q0 |
| 1920 | 7: SETVL q0 |
| 1921 | 8: LOADVB (t0), q0 |
| 1922 | 9: LDB (t0), t0 |
| 1923 | |
| 1924 | 10: TAG1o q0 = SWiden14 ( q0 ) |
| 1925 | 11: WIDENL_Bs t0 |
| 1926 | |
| 1927 | 12: PUTVL q0, %EAX |
| 1928 | 13: PUTL t0, %EAX |
| 1929 | |
| 1930 | 14: GETVL %ECX, q8 |
| 1931 | 15: GETL %ECX, t8 |
| 1932 | |
| 1933 | 16: MOVL q0, q4 |
| 1934 | 17: SHLL $0x1, q4 |
| 1935 | 18: TAG2o q4 = UifU4 ( q8, q4 ) |
| 1936 | 19: TAG1o q4 = Left4 ( q4 ) |
| 1937 | 20: LEA2L 1(t8,t0,2), t4 |
| 1938 | |
| 1939 | 21: TESTVL q4 |
| 1940 | 22: SETVL q4 |
| 1941 | 23: LOADVB (t4), q10 |
| 1942 | 24: LDB (t4), t10 |
| 1943 | |
| 1944 | 25: SETVB q12 |
| 1945 | 26: MOVB $0x20, t12 |
| 1946 | |
| 1947 | 27: MOVL q10, q14 |
| 1948 | 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 ) |
| 1949 | 29: TAG2o q10 = UifU1 ( q12, q10 ) |
| 1950 | 30: TAG2o q10 = DifD1 ( q14, q10 ) |
| 1951 | 31: MOVL q12, q14 |
| 1952 | 32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 ) |
| 1953 | 33: TAG2o q10 = DifD1 ( q14, q10 ) |
| 1954 | 34: MOVL q10, q16 |
| 1955 | 35: TAG1o q16 = PCast10 ( q16 ) |
| 1956 | 36: PUTVFo q16 |
| 1957 | 37: ANDB t12, t10 (-wOSZACP) |
| 1958 | |
| 1959 | 38: INCEIPo $9 |
| 1960 | |
| 1961 | 39: GETVFo q18 |
| 1962 | 40: TESTVo q18 |
| 1963 | 41: SETVo q18 |
| 1964 | 42: Jnzo $0x40435A50 (-rOSZACP) |
| 1965 | |
| 1966 | 43: JMPo $0x40435A5B]]></programlisting> |
| 1967 | |
| 1968 | </sect2> |
| 1969 | |
| 1970 | |
| 1971 | |
| 1972 | <sect2 id="mc-tech-docs.cleanup" |
| 1973 | xreflabel="UCode post-instrumentation cleanup"> |
| 1974 | <title>UCode post-instrumentation cleanup</title> |
| 1975 | |
| 1976 | <para>This pass, coordinated by |
| 1977 | <computeroutput>vg_cleanup()</computeroutput>, removes redundant |
| 1978 | definedness computation created by the simplistic instrumentation |
| 1979 | pass. It consists of two passes, |
| 1980 | <computeroutput>vg_propagate_definedness()</computeroutput> |
| 1981 | followed by |
| 1982 | <computeroutput>vg_delete_redundant_SETVs</computeroutput>.</para> |
| 1983 | |
| 1984 | <para><computeroutput>vg_propagate_definedness()</computeroutput> |
| 1985 | is a simple constant-propagation and constant-folding pass. It |
| 1986 | tries to determine which |
| 1987 | <computeroutput>TempReg</computeroutput>s containing V bits will |
| 1988 | always indicate "fully defined", and it propagates this |
| 1989 | information as far as it can, and folds out as many operations as |
| 1990 | possible. For example, the instrumentation for an ADD of a |
| 1991 | literal to a variable quantity will be reduced down so that the |
| 1992 | definedness of the result is simply the definedness of the |
| 1993 | variable quantity, since the literal is by definition fully |
| 1994 | defined.</para> |
| 1995 | |
| 1996 | <para><computeroutput>vg_delete_redundant_SETVs</computeroutput> |
| 1997 | removes <computeroutput>SETV</computeroutput>s on shadow |
| 1998 | <computeroutput>TempReg</computeroutput>s for which the next |
| 1999 | action is a write. I don't think there's anything else worth |
| 2000 | saying about this; it is simple. Read the sources for |
| 2001 | details.</para> |
| 2002 | |
| 2003 | <para>So the cleaned-up running example looks like this. As |
| 2004 | above, I have inserted line breaks after every original |
| 2005 | (non-instrumentation) uinstr to aid readability. As with |
| 2006 | straightforward ucode optimisation, the results in this block are |
| 2007 | undramatic because it is so short; longer blocks benefit more |
| 2008 | because they have more redundancy which gets eliminated.</para> |
| 2009 | |
| 2010 | <programlisting><![CDATA[ |
| 2011 | at 29: delete UifU1 due to defd arg1 |
| 2012 | at 32: change ImproveAND1_TQ to MOV due to defd arg2 |
| 2013 | at 41: delete SETV |
| 2014 | at 31: delete MOV |
| 2015 | at 25: delete SETV |
| 2016 | at 22: delete SETV |
| 2017 | at 7: delete SETV |
| 2018 | |
| 2019 | 0: GETVL %EDX, q0 |
| 2020 | 1: GETL %EDX, t0 |
| 2021 | |
| 2022 | 2: TAG1o q0 = Left4 ( q0 ) |
| 2023 | 3: INCL t0 |
| 2024 | |
| 2025 | 4: PUTVL q0, %EDX |
| 2026 | 5: PUTL t0, %EDX |
| 2027 | |
| 2028 | 6: TESTVL q0 |
| 2029 | 8: LOADVB (t0), q0 |
| 2030 | 9: LDB (t0), t0 |
| 2031 | |
| 2032 | 10: TAG1o q0 = SWiden14 ( q0 ) |
| 2033 | 11: WIDENL_Bs t0 |
| 2034 | |
| 2035 | 12: PUTVL q0, %EAX |
| 2036 | 13: PUTL t0, %EAX |
| 2037 | |
| 2038 | 14: GETVL %ECX, q8 |
| 2039 | 15: GETL %ECX, t8 |
| 2040 | |
| 2041 | 16: MOVL q0, q4 |
| 2042 | 17: SHLL $0x1, q4 |
| 2043 | 18: TAG2o q4 = UifU4 ( q8, q4 ) |
| 2044 | 19: TAG1o q4 = Left4 ( q4 ) |
| 2045 | 20: LEA2L 1(t8,t0,2), t4 |
| 2046 | |
| 2047 | 21: TESTVL q4 |
| 2048 | 23: LOADVB (t4), q10 |
| 2049 | 24: LDB (t4), t10 |
| 2050 | |
| 2051 | 26: MOVB $0x20, t12 |
| 2052 | |
| 2053 | 27: MOVL q10, q14 |
| 2054 | 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 ) |
| 2055 | 30: TAG2o q10 = DifD1 ( q14, q10 ) |
| 2056 | 32: MOVL t12, q14 |
| 2057 | 33: TAG2o q10 = DifD1 ( q14, q10 ) |
| 2058 | 34: MOVL q10, q16 |
| 2059 | 35: TAG1o q16 = PCast10 ( q16 ) |
| 2060 | 36: PUTVFo q16 |
| 2061 | 37: ANDB t12, t10 (-wOSZACP) |
| 2062 | |
| 2063 | 38: INCEIPo $9 |
| 2064 | 39: GETVFo q18 |
| 2065 | 40: TESTVo q18 |
| 2066 | 42: Jnzo $0x40435A50 (-rOSZACP) |
| 2067 | |
| 2068 | 43: JMPo $0x40435A5B]]></programlisting> |
| 2069 | |
| 2070 | </sect2> |
| 2071 | |
| 2072 | |
| 2073 | |
| 2074 | <sect2 id="mc-tech-docs.transfrom" xreflabel="Translation from UCode"> |
| 2075 | <title>Translation from UCode</title> |
| 2076 | |
| 2077 | <para>This is all very simple, even though |
| 2078 | <filename>vg_from_ucode.c</filename> is a big file. |
| 2079 | Position-independent x86 code is generated into a dynamically |
| 2080 | allocated array <computeroutput>emitted_code</computeroutput>; |
| 2081 | this is doubled in size when it overflows. Eventually the array |
| 2082 | is handed back to the caller of |
| 2083 | <computeroutput>VG_(translate)</computeroutput>, who must copy |
| 2084 | the result into TC and TT, and free the array.</para> |
| 2085 | |
| 2086 | <para>This file is structured into four layers of abstraction, |
| 2087 | which, thankfully, are glued back together with extensive |
| 2088 | <computeroutput>__inline__</computeroutput> directives. From the |
| 2089 | bottom upwards:</para> |
| 2090 | |
| 2091 | <itemizedlist> |
| 2092 | |
| 2093 | <listitem> |
| 2094 | <para>Address-mode emitters, |
| 2095 | <computeroutput>emit_amode_regmem_reg</computeroutput> et |
| 2096 | al.</para> |
| 2097 | </listitem> |
| 2098 | |
| 2099 | <listitem> |
| 2100 | <para>Emitters for specific x86 instructions. There are |
| 2101 | quite a lot of these, with names such as |
| 2102 | <computeroutput>emit_movv_offregmem_reg</computeroutput>. |
| 2103 | The <computeroutput>v</computeroutput> suffix is Intel |
| 2104 | parlance for a 16/32 bit insn; there are also |
| 2105 | <computeroutput>b</computeroutput> suffixes for 8 bit |
| 2106 | insns.</para> |
| 2107 | </listitem> |
| 2108 | |
| 2109 | <listitem> |
| 2110 | <para>The next level up are the |
| 2111 | <computeroutput>synth_*</computeroutput> functions, which |
| 2112 | synthesise possibly a sequence of raw x86 instructions to do |
| 2113 | some simple task. Some of these are quite complex because |
| 2114 | they have to work around Intel's silly restrictions on |
| 2115 | subregister naming. See |
| 2116 | <computeroutput>synth_nonshiftop_reg_reg</computeroutput> for |
| 2117 | example.</para> |
| 2118 | </listitem> |
| 2119 | |
| 2120 | <listitem> |
| 2121 | <para>Finally, at the top of the heap, we have |
| 2122 | <computeroutput>emitUInstr()</computeroutput>, which emits |
| 2123 | code for a single uinstr.</para> |
| 2124 | </listitem> |
| 2125 | |
| 2126 | </itemizedlist> |
| 2127 | |
| 2128 | <para>Some comments:</para> |
| 2129 | |
| 2130 | <itemizedlist> |
| 2131 | |
| 2132 | <listitem> |
| 2133 | <para>The hack for FPU instructions becomes apparent here. |
| 2134 | To do a <computeroutput>FPU</computeroutput> ucode |
| 2135 | instruction, we load the simulated FPU's state into from its |
| 2136 | <computeroutput>VG_(baseBlock)</computeroutput> into the real |
| 2137 | FPU using an x86 <computeroutput>frstor</computeroutput> |
| 2138 | insn, do the ucode <computeroutput>FPU</computeroutput> insn |
| 2139 | on the real CPU, and write the updated FPU state back into |
| 2140 | <computeroutput>VG_(baseBlock)</computeroutput> using an |
| 2141 | <computeroutput>fnsave</computeroutput> instruction. This is |
| 2142 | pretty brutal, but is simple and it works, and even seems |
| 2143 | tolerably efficient. There is no attempt to cache the |
| 2144 | simulated FPU state in the real FPU over multiple |
| 2145 | back-to-back ucode FPU instructions.</para> |
| 2146 | |
| 2147 | <para><computeroutput>FPU_R</computeroutput> and |
| 2148 | <computeroutput>FPU_W</computeroutput> are also done this |
| 2149 | way, with the minor complication that we need to patch in |
| 2150 | some addressing mode bits so the resulting insn knows the |
| 2151 | effective address to use. This is easy because of the |
| 2152 | regularity of the x86 FPU instruction encodings.</para> |
| 2153 | </listitem> |
| 2154 | |
| 2155 | <listitem> |
| 2156 | <para>An analogous trick is done with ucode insns which |
| 2157 | claim, in their <computeroutput>flags_r</computeroutput> and |
| 2158 | <computeroutput>flags_w</computeroutput> fields, that they |
| 2159 | read or write the simulated |
| 2160 | <computeroutput>%EFLAGS</computeroutput>. For such cases we |
| 2161 | first copy the simulated |
| 2162 | <computeroutput>%EFLAGS</computeroutput> into the real |
| 2163 | <computeroutput>%eflags</computeroutput>, then do the insn, |
| 2164 | then, if the insn says it writes the flags, copy back to |
| 2165 | <computeroutput>%EFLAGS</computeroutput>. This is a bit |
| 2166 | expensive, which is why the ucode optimisation pass goes to |
| 2167 | some effort to remove redundant flag-update annotations.</para> |
| 2168 | </listitem> |
| 2169 | |
| 2170 | </itemizedlist> |
| 2171 | |
| 2172 | <para>And so ... that's the end of the documentation for the |
| 2173 | instrumentating translator! It's really not that complex, |
| 2174 | because it's composed as a sequence of simple(ish) self-contained |
| 2175 | transformations on straight-line blocks of code.</para> |
| 2176 | |
| 2177 | </sect2> |
| 2178 | |
| 2179 | |
| 2180 | |
| 2181 | <sect2 id="mc-tech-docs.dispatch" xreflabel="Top-level dispatch loop"> |
| 2182 | <title>Top-level dispatch loop</title> |
| 2183 | |
| 2184 | <para>Urk. In <computeroutput>VG_(toploop)</computeroutput>. |
| 2185 | This is basically boring and unsurprising, not to mention fiddly |
| 2186 | and fragile. It needs to be cleaned up.</para> |
| 2187 | |
| 2188 | <para>The only perhaps surprise is that the whole thing is run on |
| 2189 | top of a <computeroutput>setjmp</computeroutput>-installed |
| 2190 | exception handler, because, supposing a translation got a |
| 2191 | segfault, we have to bail out of the Valgrind-supplied exception |
| 2192 | handler <computeroutput>VG_(oursignalhandler)</computeroutput> |
| 2193 | and immediately start running the client's segfault handler, if |
| 2194 | it has one. In particular we can't finish the current basic |
| 2195 | block and then deliver the signal at some convenient future |
| 2196 | point, because signals like SIGILL, SIGSEGV and SIGBUS mean that |
| 2197 | the faulting insn should not simply be re-tried. (I'm sure there |
| 2198 | is a clearer way to explain this).</para> |
| 2199 | |
| 2200 | </sect2> |
| 2201 | |
| 2202 | |
| 2203 | |
| 2204 | <sect2 id="mc-tech-docs.lazy" |
| 2205 | xreflabel="Lazy updates of the simulated program counter"> |
| 2206 | <title>Lazy updates of the simulated program counter</title> |
| 2207 | |
| 2208 | <para>Simulated <computeroutput>%EIP</computeroutput> is not |
| 2209 | updated after every simulated x86 insn as this was regarded as |
| 2210 | too expensive. Instead ucode |
| 2211 | <computeroutput>INCEIP</computeroutput> insns move it along as |
| 2212 | and when necessary. Currently we don't allow it to fall more |
| 2213 | than 4 bytes behind reality (see |
| 2214 | <computeroutput>VG_(disBB)</computeroutput> for the way this |
| 2215 | works).</para> |
| 2216 | |
| 2217 | <para>Note that <computeroutput>%EIP</computeroutput> is always |
| 2218 | brought up to date by the inner dispatch loop in |
| 2219 | <computeroutput>VG_(dispatch)</computeroutput>, so that if the |
| 2220 | client takes a fault we know at least which basic block this |
| 2221 | happened in.</para> |
| 2222 | |
| 2223 | </sect2> |
| 2224 | |
| 2225 | |
| 2226 | |
| 2227 | <sect2 id="mc-tech-docs.signals" xreflabel="Signals"> |
| 2228 | <title>Signals</title> |
| 2229 | |
| 2230 | <para>Horrible, horrible. <filename>vg_signals.c</filename>. |
| 2231 | Basically, since we have to intercept all system calls anyway, we |
| 2232 | can see when the client tries to install a signal handler. If it |
| 2233 | does so, we make a note of what the client asked to happen, and |
| 2234 | ask the kernel to route the signal to our own signal handler, |
| 2235 | <computeroutput>VG_(oursignalhandler)</computeroutput>. This |
| 2236 | simply notes the delivery of signals, and returns.</para> |
| 2237 | |
| 2238 | <para>Every 1000 basic blocks, we see if more signals have |
| 2239 | arrived. If so, |
| 2240 | <computeroutput>VG_(deliver_signals)</computeroutput> builds |
| 2241 | signal delivery frames on the client's stack, and allows their |
| 2242 | handlers to be run. Valgrind places in these signal delivery |
| 2243 | frames a bogus return address, |
| 2244 | <computeroutput>VG_(signalreturn_bogusRA)</computeroutput>, and |
| 2245 | checks all jumps to see if any jump to it. If so, this is a sign |
| 2246 | that a signal handler is returning, and if so Valgrind removes |
| 2247 | the relevant signal frame from the client's stack, restores the |
| 2248 | from the signal frame the simulated state before the signal was |
| 2249 | delivered, and allows the client to run onwards. We have to do |
| 2250 | it this way because some signal handlers never return, they just |
| 2251 | <computeroutput>longjmp()</computeroutput>, which nukes the |
| 2252 | signal delivery frame.</para> |
| 2253 | |
| 2254 | <para>The Linux kernel has a different but equally horrible hack |
| 2255 | for detecting signal handler returns. Discovering it is left as |
| 2256 | an exercise for the reader.</para> |
| 2257 | |
| 2258 | </sect2> |
| 2259 | |
| 2260 | |
| 2261 | <sect2 id="mc-tech-docs.todo"> |
| 2262 | <title>To be written</title> |
| 2263 | |
| 2264 | <para>The following is a list of as-yet-not-written stuff. Apologies.</para> |
| 2265 | <orderedlist> |
| 2266 | <listitem> |
| 2267 | <para>The translation cache and translation table</para> |
| 2268 | </listitem> |
| 2269 | <listitem> |
| 2270 | <para>Exceptions, creating new translations</para> |
| 2271 | </listitem> |
| 2272 | <listitem> |
| 2273 | <para>Self-modifying code</para> |
| 2274 | </listitem> |
| 2275 | <listitem> |
| 2276 | <para>Errors, error contexts, error reporting, suppressions</para> |
| 2277 | </listitem> |
| 2278 | <listitem> |
| 2279 | <para>Client malloc/free</para> |
| 2280 | </listitem> |
| 2281 | <listitem> |
| 2282 | <para>Low-level memory management</para> |
| 2283 | </listitem> |
| 2284 | <listitem> |
| 2285 | <para>A and V bitmaps</para> |
| 2286 | </listitem> |
| 2287 | <listitem> |
| 2288 | <para>Symbol table management</para> |
| 2289 | </listitem> |
| 2290 | <listitem> |
| 2291 | <para>Dealing with system calls</para> |
| 2292 | </listitem> |
| 2293 | <listitem> |
| 2294 | <para>Namespace management</para> |
| 2295 | </listitem> |
| 2296 | <listitem> |
| 2297 | <para>GDB attaching</para> |
| 2298 | </listitem> |
| 2299 | <listitem> |
| 2300 | <para>Non-dependence on glibc or anything else</para> |
| 2301 | </listitem> |
| 2302 | <listitem> |
| 2303 | <para>The leak detector</para> |
| 2304 | </listitem> |
| 2305 | <listitem> |
| 2306 | <para>Performance problems</para> |
| 2307 | </listitem> |
| 2308 | <listitem> |
| 2309 | <para>Continuous sanity checking</para> |
| 2310 | </listitem> |
| 2311 | <listitem> |
| 2312 | <para>Tracing, or not tracing, child processes</para> |
| 2313 | </listitem> |
| 2314 | <listitem> |
| 2315 | <para>Assembly glue for syscalls</para> |
| 2316 | </listitem> |
| 2317 | </orderedlist> |
| 2318 | |
| 2319 | </sect2> |
| 2320 | |
| 2321 | </sect1> |
| 2322 | |
| 2323 | |
| 2324 | |
| 2325 | |
| 2326 | <sect1 id="mc-tech-docs.extensions" xreflabel="Extensions"> |
| 2327 | <title>Extensions</title> |
| 2328 | |
| 2329 | <para>Some comments about Stuff To Do.</para> |
| 2330 | |
| 2331 | <sect2 id="mc-tech-docs.bugs" xreflabel="Bugs"> |
| 2332 | <title>Bugs</title> |
| 2333 | |
| 2334 | <para>Stephan Kulow and Marc Mutz report problems with kmail in |
| 2335 | KDE 3 CVS (RC2 ish) when run on Valgrind. Stephan has it |
| 2336 | deadlocking; Marc has it looping at startup. I can't repro |
| 2337 | either behaviour. Needs repro-ing and fixing.</para> |
| 2338 | |
| 2339 | </sect2> |
| 2340 | |
| 2341 | |
| 2342 | <sect2 id="mc-tech-docs.threads" xreflabel="Threads"> |
| 2343 | <title>Threads</title> |
| 2344 | |
| 2345 | <para>Doing a good job of thread support strikes me as almost a |
| 2346 | research-level problem. The central issues are how to do fast |
| 2347 | cheap locking of the |
| 2348 | <computeroutput>VG_(primary_map)</computeroutput> structure, |
| 2349 | whether or not accesses to the individual secondary maps need |
| 2350 | locking, what race-condition issues result, and whether the |
| 2351 | already-nasty mess that is the signal simulator needs further |
| 2352 | hackery.</para> |
| 2353 | |
| 2354 | <para>I realise that threads are the most-frequently-requested |
| 2355 | feature, and I am thinking about it all. If you have guru-level |
| 2356 | understanding of fast mutual exclusion mechanisms and race |
| 2357 | conditions, I would be interested in hearing from you.</para> |
| 2358 | |
| 2359 | </sect2> |
| 2360 | |
| 2361 | |
| 2362 | |
| 2363 | <sect2 id="mc-tech-docs.verify" xreflabel="Verification suite"> |
| 2364 | <title>Verification suite</title> |
| 2365 | |
| 2366 | <para>Directory <computeroutput>tests/</computeroutput> contains |
| 2367 | various ad-hoc tests for Valgrind. However, there is no |
| 2368 | systematic verification or regression suite, that, for example, |
| 2369 | exercises all the stuff in <filename>vg_memory.c</filename>, to |
| 2370 | ensure that illegal memory accesses and undefined value uses are |
| 2371 | detected as they should be. It would be good to have such a |
| 2372 | suite.</para> |
| 2373 | |
| 2374 | </sect2> |
| 2375 | |
| 2376 | |
| 2377 | <sect2 id="mc-tech-docs.porting" xreflabel="Porting to other platforms"> |
| 2378 | <title>Porting to other platforms</title> |
| 2379 | |
| 2380 | <para>It would be great if Valgrind was ported to FreeBSD and x86 |
| 2381 | NetBSD, and to x86 OpenBSD, if it's possible (doesn't OpenBSD use |
| 2382 | a.out-style executables, not ELF ?)</para> |
| 2383 | |
| 2384 | <para>The main difficulties, for an x86-ELF platform, seem to |
| 2385 | be:</para> |
| 2386 | |
| 2387 | <itemizedlist> |
| 2388 | |
| 2389 | <listitem> |
| 2390 | <para>You'd need to rewrite the |
| 2391 | <computeroutput>/proc/self/maps</computeroutput> parser |
| 2392 | (<filename>vg_procselfmaps.c</filename>). Easy.</para> |
| 2393 | </listitem> |
| 2394 | |
| 2395 | <listitem> |
| 2396 | <para>You'd need to rewrite |
| 2397 | <filename>vg_syscall_mem.c</filename>, or, more specifically, |
| 2398 | provide one for your OS. This is tedious, but you can |
| 2399 | implement syscalls on demand, and the Linux kernel interface |
| 2400 | is, for the most part, going to look very similar to the *BSD |
| 2401 | interfaces, so it's really a copy-paste-and-modify-on-demand |
| 2402 | job. As part of this, you'd need to supply a new |
| 2403 | <filename>vg_kerneliface.h</filename> file.</para> |
| 2404 | </listitem> |
| 2405 | |
| 2406 | <listitem> |
| 2407 | <para>You'd also need to change the syscall wrappers for |
| 2408 | Valgrind's internal use, in |
| 2409 | <filename>vg_mylibc.c</filename>.</para> |
| 2410 | </listitem> |
| 2411 | |
| 2412 | </itemizedlist> |
| 2413 | |
| 2414 | <para>All in all, I think a port to x86-ELF *BSDs is not really |
| 2415 | very difficult, and in some ways I would like to see it happen, |
| 2416 | because that would force a more clear factoring of Valgrind into |
| 2417 | platform dependent and independent pieces. Not to mention, *BSD |
| 2418 | folks also deserve to use Valgrind just as much as the Linux crew |
| 2419 | do.</para> |
| 2420 | |
| 2421 | </sect2> |
| 2422 | |
| 2423 | </sect1> |
| 2424 | |
| 2425 | |
| 2426 | |
| 2427 | <sect1 id="mc-tech-docs.easystuff" |
| 2428 | xreflabel="Easy stuff which ought to be done"> |
| 2429 | <title>Easy stuff which ought to be done</title> |
| 2430 | |
| 2431 | |
| 2432 | <sect2 id="mc-tech-docs.mmx" xreflabel="MMX Instructions"> |
| 2433 | <title>MMX Instructions</title> |
| 2434 | |
| 2435 | <para>MMX insns should be supported, using the same trick as for |
| 2436 | FPU insns. If the MMX registers are not used to copy |
| 2437 | uninitialised junk from one place to another in memory, this |
| 2438 | means we don't have to actually simulate the internal MMX unit |
| 2439 | state, so the FPU hack applies. This should be fairly |
| 2440 | easy.</para> |
| 2441 | |
| 2442 | </sect2> |
| 2443 | |
| 2444 | |
| 2445 | <sect2 id="mc-tech-docs.fixstabs" xreflabel="Fix stabs-info Reader"> |
| 2446 | <title>Fix stabs-info reader</title> |
| 2447 | |
| 2448 | <para>The machinery in <filename>vg_symtab2.c</filename> which |
| 2449 | reads "stabs" style debugging info is pretty weak. It usually |
| 2450 | correctly translates simulated program counter values into line |
| 2451 | numbers and procedure names, but the file name is often |
| 2452 | completely wrong. I think the logic used to parse "stabs" |
| 2453 | entries is weak. It should be fixed. The simplest solution, |
| 2454 | IMO, is to copy either the logic or simply the code out of GNU |
| 2455 | binutils which does this; since GDB can clearly get it right, |
| 2456 | binutils (or GDB?) must have code to do this somewhere.</para> |
| 2457 | |
| 2458 | </sect2> |
| 2459 | |
| 2460 | |
| 2461 | |
| 2462 | <sect2 id="mc-tech-docs.x86instr" xreflabel="BT/BTC/BTS/BTR"> |
| 2463 | <title>BT/BTC/BTS/BTR</title> |
| 2464 | |
| 2465 | <para>These are x86 instructions which test, complement, set, or |
| 2466 | reset, a single bit in a word. At the moment they are both |
| 2467 | incorrectly implemented and incorrectly instrumented.</para> |
| 2468 | |
| 2469 | <para>The incorrect instrumentation is due to use of helper |
| 2470 | functions. This means we lose bit-level definedness tracking, |
| 2471 | which could wind up giving spurious uninitialised-value use |
| 2472 | errors. The Right Thing to do is to invent a couple of new |
| 2473 | UOpcodes, I think <computeroutput>GET_BIT</computeroutput> and |
| 2474 | <computeroutput>SET_BIT</computeroutput>, which can be used to |
| 2475 | implement all 4 x86 insns, get rid of the helpers, and give |
| 2476 | bit-accurate instrumentation rules for the two new |
| 2477 | UOpcodes.</para> |
| 2478 | |
| 2479 | <para>I realised the other day that they are mis-implemented too. |
| 2480 | The x86 insns take a bit-index and a register or memory location |
| 2481 | to access. For registers the bit index clearly can only be in |
| 2482 | the range zero to register-width minus 1, and I assumed the same |
| 2483 | applied to memory locations too. But evidently not; for memory |
| 2484 | locations the index can be arbitrary, and the processor will |
| 2485 | index arbitrarily into memory as a result. This too should be |
| 2486 | fixed. Sigh. Presumably indexing outside the immediate word is |
| 2487 | not actually used by any programs yet tested on Valgrind, for |
| 2488 | otherwise they (presumably) would simply not work at all. If you |
| 2489 | plan to hack on this, first check the Intel docs to make sure my |
| 2490 | understanding is really correct.</para> |
| 2491 | |
| 2492 | </sect2> |
| 2493 | |
| 2494 | |
| 2495 | <sect2 id="mc-tech-docs.prefetch" xreflabel="Using PREFETCH Instructions"> |
| 2496 | <title>Using PREFETCH Instructions</title> |
| 2497 | |
| 2498 | <para>Here's a small but potentially interesting project for |
| 2499 | performance junkies. Experiments with valgrind's code generator |
| 2500 | and optimiser(s) suggest that reducing the number of instructions |
| 2501 | executed in the translations and mem-check helpers gives |
| 2502 | disappointingly small performance improvements. Perhaps this is |
| 2503 | because performance of Valgrindified code is limited by cache |
| 2504 | misses. After all, each read in the original program now gives |
| 2505 | rise to at least three reads, one for the |
| 2506 | <computeroutput>VG_(primary_map)</computeroutput>, one of the |
| 2507 | resulting secondary, and the original. Not to mention, the |
| 2508 | instrumented translations are 13 to 14 times larger than the |
| 2509 | originals. All in all one would expect the memory system to be |
| 2510 | hammered to hell and then some.</para> |
| 2511 | |
| 2512 | <para>So here's an idea. An x86 insn involving a read from |
| 2513 | memory, after instrumentation, will turn into ucode of the |
| 2514 | following form:</para> |
| 2515 | <programlisting><![CDATA[ |
| 2516 | ... calculate effective addr, into ta and qa ... |
| 2517 | TESTVL qa -- is the addr defined? |
| 2518 | LOADV (ta), qloaded -- fetch V bits for the addr |
| 2519 | LOAD (ta), tloaded -- do the original load]]></programlisting> |
| 2520 | |
| 2521 | <para>At the point where the |
| 2522 | <computeroutput>LOADV</computeroutput> is done, we know the |
| 2523 | actual address (<computeroutput>ta</computeroutput>) from which |
| 2524 | the real <computeroutput>LOAD</computeroutput> will be done. We |
| 2525 | also know that the <computeroutput>LOADV</computeroutput> will |
| 2526 | take around 20 x86 insns to do. So it seems plausible that doing |
| 2527 | a prefetch of <computeroutput>ta</computeroutput> just before the |
| 2528 | <computeroutput>LOADV</computeroutput> might just avoid a miss at |
| 2529 | the <computeroutput>LOAD</computeroutput> point, and that might |
| 2530 | be a significant performance win.</para> |
| 2531 | |
| 2532 | <para>Prefetch insns are notoriously tempermental, more often |
| 2533 | than not making things worse rather than better, so this would |
| 2534 | require considerable fiddling around. It's complicated because |
| 2535 | Intels and AMDs have different prefetch insns with different |
| 2536 | semantics, so that too needs to be taken into account. As a |
| 2537 | general rule, even placing the prefetches before the |
| 2538 | <computeroutput>LOADV</computeroutput> insn is too near the |
| 2539 | <computeroutput>LOAD</computeroutput>; the ideal distance is |
| 2540 | apparently circa 200 CPU cycles. So it might be worth having |
| 2541 | another analysis/transformation pass which pushes prefetches as |
| 2542 | far back as possible, hopefully immediately after the effective |
| 2543 | address becomes available.</para> |
| 2544 | |
| 2545 | <para>Doing too many prefetches is also bad because they soak up |
| 2546 | bus bandwidth / cpu resources, so some cleverness in deciding |
| 2547 | which loads to prefetch and which to not might be helpful. One |
| 2548 | can imagine not prefetching client-stack-relative |
| 2549 | (<computeroutput>%EBP</computeroutput> or |
| 2550 | <computeroutput>%ESP</computeroutput>) accesses, since the stack |
| 2551 | in general tends to show good locality anyway.</para> |
| 2552 | |
| 2553 | <para>There's quite a lot of experimentation to do here, but I |
| 2554 | think it might make an interesting week's work for |
| 2555 | someone.</para> |
| 2556 | |
| 2557 | <para>As of 15-ish March 2002, I've started to experiment with |
| 2558 | this, using the AMD |
| 2559 | <computeroutput>prefetch/prefetchw</computeroutput> insns.</para> |
| 2560 | |
| 2561 | </sect2> |
| 2562 | |
| 2563 | |
| 2564 | <sect2 id="mc-tech-docs.pranges" xreflabel="User-defined Permission Ranges"> |
| 2565 | <title>User-defined Permission Ranges</title> |
| 2566 | |
| 2567 | <para>This is quite a large project -- perhaps a month's hacking |
| 2568 | for a capable hacker to do a good job -- but it's potentially |
| 2569 | very interesting. The outcome would be that Valgrind could |
| 2570 | detect a whole class of bugs which it currently cannot.</para> |
| 2571 | |
| 2572 | <para>The presentation falls into two pieces.</para> |
| 2573 | |
| 2574 | <sect3 id="mc-tech-docs.psetting" |
| 2575 | xreflabel="Part 1: User-defined Address-range Permission Setting"> |
| 2576 | <title>Part 1: User-defined Address-range Permission Setting</title> |
| 2577 | |
| 2578 | <para>Valgrind intercepts the client's |
| 2579 | <computeroutput>malloc</computeroutput>, |
| 2580 | <computeroutput>free</computeroutput>, etc calls, watches system |
| 2581 | calls, and watches the stack pointer move. This is currently the |
| 2582 | only way it knows about which addresses are valid and which not. |
| 2583 | Sometimes the client program knows extra information about its |
| 2584 | memory areas. For example, the client could at some point know |
| 2585 | that all elements of an array are out-of-date. We would like to |
| 2586 | be able to convey to Valgrind this information that the array is |
| 2587 | now addressable-but-uninitialised, so that Valgrind can then warn |
| 2588 | if elements are used before they get new values.</para> |
| 2589 | |
| 2590 | <para>What I would like are some macros like this:</para> |
| 2591 | <programlisting><![CDATA[ |
| 2592 | VALGRIND_MAKE_NOACCESS(addr, len) |
| 2593 | VALGRIND_MAKE_WRITABLE(addr, len) |
| 2594 | VALGRIND_MAKE_READABLE(addr, len)]]></programlisting> |
| 2595 | |
| 2596 | <para>and also, to check that memory is |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 2597 | addressable/initialised,</para> |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 2598 | <programlisting><![CDATA[ |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 2599 | VALGRIND_CHECK_ADDRESSABLE(addr, len) |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 2600 | VALGRIND_CHECK_INITIALISED(addr, len)]]></programlisting> |
| 2601 | |
| 2602 | <para>I then include in my sources a header defining these |
| 2603 | macros, rebuild my app, run under Valgrind, and get user-defined |
| 2604 | checks.</para> |
| 2605 | |
| 2606 | <para>Now here's a neat trick. It's a nuisance to have to |
| 2607 | re-link the app with some new library which implements the above |
| 2608 | macros. So the idea is to define the macros so that the |
| 2609 | resulting executable is still completely stand-alone, and can be |
| 2610 | run without Valgrind, in which case the macros do nothing, but |
| 2611 | when run on Valgrind, the Right Thing happens. How to do this? |
| 2612 | The idea is for these macros to turn into a piece of inline |
| 2613 | assembly code, which (1) has no effect when run on the real CPU, |
| 2614 | (2) is easily spotted by Valgrind's JITter, and (3) no sane |
| 2615 | person would ever write, which is important for avoiding false |
| 2616 | matches in (2). So here's a suggestion:</para> |
| 2617 | <programlisting><![CDATA[ |
| 2618 | VALGRIND_MAKE_NOACCESS(addr, len)]]></programlisting> |
| 2619 | |
| 2620 | <para>becomes (roughly speaking)</para> |
| 2621 | <programlisting><![CDATA[ |
| 2622 | movl addr, %eax |
| 2623 | movl len, %ebx |
| 2624 | movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be |
| 2625 | -- 2, etc |
| 2626 | rorl $13, %ecx |
| 2627 | rorl $19, %ecx |
| 2628 | rorl $11, %eax |
| 2629 | rorl $21, %eax]]></programlisting> |
| 2630 | |
| 2631 | <para>The rotate sequences have no effect, and it's unlikely they |
| 2632 | would appear for any other reason, but they define a unique |
| 2633 | byte-sequence which the JITter can easily spot. Using the |
| 2634 | operand constraints section at the end of a gcc inline-assembly |
| 2635 | statement, we can tell gcc that the assembly fragment kills |
| 2636 | <computeroutput>%eax</computeroutput>, |
| 2637 | <computeroutput>%ebx</computeroutput>, |
| 2638 | <computeroutput>%ecx</computeroutput> and the condition codes, so |
| 2639 | this fragment is made harmless when not running on Valgrind, runs |
| 2640 | quickly when not on Valgrind, and does not require any other |
| 2641 | library support.</para> |
| 2642 | |
| 2643 | |
| 2644 | </sect3> |
| 2645 | |
| 2646 | |
| 2647 | <sect3 id="mc-tech-docs.prange-detect" |
| 2648 | xreflabel="Part 2: Using it to detect Interference between Stack |
| 2649 | Variables"> |
| 2650 | <title>Part 2: Using it to detect Interference between Stack |
| 2651 | Variables</title> |
| 2652 | |
| 2653 | <para>Currently Valgrind cannot detect errors of the following |
| 2654 | form:</para> |
| 2655 | <programlisting><![CDATA[ |
| 2656 | void fooble ( void ) |
| 2657 | { |
| 2658 | int a[10]; |
| 2659 | int b[10]; |
| 2660 | a[10] = 99; |
| 2661 | }]]></programlisting> |
| 2662 | |
| 2663 | <para>Now imagine rewriting this as</para> |
| 2664 | <programlisting><![CDATA[ |
| 2665 | void fooble ( void ) |
| 2666 | { |
| 2667 | int spacer0; |
| 2668 | int a[10]; |
| 2669 | int spacer1; |
| 2670 | int b[10]; |
| 2671 | int spacer2; |
| 2672 | VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int)); |
| 2673 | VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int)); |
| 2674 | VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int)); |
| 2675 | a[10] = 99; |
| 2676 | }]]></programlisting> |
| 2677 | |
| 2678 | <para>Now the invalid write is certain to hit |
| 2679 | <computeroutput>spacer0</computeroutput> or |
| 2680 | <computeroutput>spacer1</computeroutput>, so Valgrind will spot |
| 2681 | the error.</para> |
| 2682 | |
| 2683 | <para>There are two complications.</para> |
| 2684 | |
| 2685 | <orderedlist> |
| 2686 | |
| 2687 | <listitem> |
| 2688 | <para>The first is that we don't want to annotate sources by |
| 2689 | hand, so the Right Thing to do is to write a C/C++ parser, |
| 2690 | annotator, prettyprinter which does this automatically, and |
de | 97ab7e7 | 2005-11-27 18:19:40 +0000 | [diff] [blame] | 2691 | run it on post-CPP'd C/C++ source. The parser/prettyprinter |
| 2692 | is probably not as hard as it sounds; I would write it in Haskell, |
| 2693 | a powerful functional language well suited to doing symbolic |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 2694 | computation, with which I am intimately familiar. There is |
njn | 3e986b2 | 2004-11-30 10:43:45 +0000 | [diff] [blame] | 2695 | already a C parser written in Haskell by someone in the |
| 2696 | Haskell community, and that would probably be a good starting |
| 2697 | point.</para> |
| 2698 | </listitem> |
| 2699 | |
| 2700 | |
| 2701 | <listitem> |
| 2702 | <para>The second complication is how to get rid of these |
| 2703 | <computeroutput>NOACCESS</computeroutput> records inside |
| 2704 | Valgrind when the instrumented function exits; after all, |
| 2705 | these refer to stack addresses and will make no sense |
| 2706 | whatever when some other function happens to re-use the same |
| 2707 | stack address range, probably shortly afterwards. I think I |
| 2708 | would be inclined to define a special stack-specific |
| 2709 | macro:</para> |
| 2710 | <programlisting><![CDATA[ |
| 2711 | VALGRIND_MAKE_NOACCESS_STACK(addr, len)]]></programlisting> |
| 2712 | <para>which causes Valgrind to record the client's |
| 2713 | <computeroutput>%ESP</computeroutput> at the time it is |
| 2714 | executed. Valgrind will then watch for changes in |
| 2715 | <computeroutput>%ESP</computeroutput> and discard such |
| 2716 | records as soon as the protected area is uncovered by an |
| 2717 | increase in <computeroutput>%ESP</computeroutput>. I |
| 2718 | hesitate with this scheme only because it is potentially |
| 2719 | expensive, if there are hundreds of such records, and |
| 2720 | considering that changes in |
| 2721 | <computeroutput>%ESP</computeroutput> already require |
| 2722 | expensive messing with stack access permissions.</para> |
| 2723 | </listitem> |
| 2724 | </orderedlist> |
| 2725 | |
| 2726 | <para>This is probably easier and more robust than for the |
| 2727 | instrumenter program to try and spot all exit points for the |
| 2728 | procedure and place suitable deallocation annotations there. |
| 2729 | Plus C++ procedures can bomb out at any point if they get an |
| 2730 | exception, so spotting return points at the source level just |
| 2731 | won't work at all.</para> |
| 2732 | |
| 2733 | <para>Although some work, it's all eminently doable, and it would |
| 2734 | make Valgrind into an even-more-useful tool.</para> |
| 2735 | |
| 2736 | </sect3> |
| 2737 | |
| 2738 | </sect2> |
| 2739 | |
| 2740 | </sect1> |
| 2741 | </chapter> |