blob: 492902c158e2e1ebd26f3ca152ad2435fef90c57 [file] [log] [blame]
njn3e986b22004-11-30 10:43:45 +00001<?xml version="1.0"?> <!-- -*- sgml -*- -->
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
4
5<chapter id="mc-tech-docs"
6 xreflabel="The design and implementation of Valgrind">
7
8<title>The Design and Implementation of Valgrind</title>
9<subtitle>Detailed technical notes for hackers, maintainers and
10 the overly-curious</subtitle>
11
12<sect1 id="mc-tech-docs.intro" xreflabel="Introduction">
13<title>Introduction</title>
14
15<para>This document contains a detailed, highly-technical
16description of the internals of Valgrind. This is not the user
17manual; if you are an end-user of Valgrind, you do not want to
18read this. Conversely, if you really are a hacker-type and want
19to know how it works, I assume that you have read the user manual
20thoroughly.</para>
21
22<para>You may need to read this document several times, and
23carefully. Some important things, I only say once.</para>
24
25
26
27
28<sect2 id="mc-tech-docs.history" xreflabel="History">
29<title>History</title>
30
31<para>Valgrind came into public view in late Feb 2002. However,
32it has been under contemplation for a very long time, perhaps
33seriously for about five years. Somewhat over two years ago, I
34started working on the x86 code generator for the Glasgow Haskell
35Compiler (http://www.haskell.org/ghc), gaining familiarity with
36x86 internals on the way. I then did Cacheprof
37(http://www.cacheprof.org), gaining further x86 experience. Some
38time around Feb 2000 I started experimenting with a user-space
39x86 interpreter for x86-Linux. This worked, but it was clear
40that a JIT-based scheme would be necessary to give reasonable
41performance for Valgrind. Design work for the JITter started in
42earnest in Oct 2000, and by early 2001 I had an x86-to-x86
43dynamic translator which could run quite large programs. This
44translator was in a sense pointless, since it did not do any
45instrumentation or checking.</para>
46
47<para>Most of the rest of 2001 was taken up designing and
48implementing the instrumentation scheme. The main difficulty,
49which consumed a lot of effort, was to design a scheme which did
50not generate large numbers of false uninitialised-value warnings.
51By late 2001 a satisfactory scheme had been arrived at, and I
52started to test it on ever-larger programs, with an eventual eye
53to making it work well enough so that it was helpful to folks
54debugging the upcoming version 3 of KDE. I've used KDE since
55before version 1.0, and wanted to Valgrind to be an indirect
56contribution to the KDE 3 development effort. At the start of
57Feb 02 the kde-core-devel crew started using it, and gave a huge
58amount of helpful feedback and patches in the space of three
59weeks. Snapshot 20020306 is the result.</para>
60
61<para>In the best Unix tradition, or perhaps in the spirit of
62Fred Brooks' depressing-but-completely-accurate epitaph "build
63one to throw away; you will anyway", much of Valgrind is a second
64or third rendition of the initial idea. The instrumentation
65machinery (<filename>vg_translate.c</filename>,
66<filename>vg_memory.c</filename>) and core CPU simulation
67(<filename>vg_to_ucode.c</filename>,
68<filename>vg_from_ucode.c</filename>) have had three redesigns
69and rewrites; the register allocator, low-level memory manager
70(<filename>vg_malloc2.c</filename>) and symbol table reader
71(<filename>vg_symtab2.c</filename>) are on the second rewrite.
72In a sense, this document serves to record some of the knowledge
73gained as a result.</para>
74
75</sect2>
76
77
78<sect2 id="mc-tech-docs.overview" xreflabel="Design overview">
79<title>Design overview</title>
80
81<para>Valgrind is compiled into a Linux shared object,
82<filename>valgrind.so</filename>, and also a dummy one,
83<filename>valgrinq.so</filename>, of which more later. The
84<filename>valgrind</filename> shell script adds
85<filename>valgrind.so</filename> to the
86<computeroutput>LD_PRELOAD</computeroutput> list of extra
87libraries to be loaded with any dynamically linked library. This
88is a standard trick, one which I assume the
89<computeroutput>LD_PRELOAD</computeroutput> mechanism was
90developed to support.</para>
91
92<para><filename>valgrind.so</filename> is linked with the
93<computeroutput>-z initfirst</computeroutput> flag, which
94requests that its initialisation code is run before that of any
95other object in the executable image. When this happens,
96valgrind gains control. The real CPU becomes "trapped" in
97<filename>valgrind.so</filename> and the translations it
98generates. The synthetic CPU provided by Valgrind does, however,
99return from this initialisation function. So the normal startup
100actions, orchestrated by the dynamic linker
101<filename>ld.so</filename>, continue as usual, except on the
102synthetic CPU, not the real one. Eventually
103<computeroutput>main</computeroutput> is run and returns, and
104then the finalisation code of the shared objects is run,
105presumably in inverse order to which they were initialised.
106Remember, this is still all happening on the simulated CPU.
107Eventually <filename>valgrind.so</filename>'s own finalisation
108code is called. It spots this event, shuts down the simulated
109CPU, prints any error summaries and/or does leak detection, and
110returns from the initialisation code on the real CPU. At this
111point, in effect the real and synthetic CPUs have merged back
112into one, Valgrind has lost control of the program, and the
113program finally <computeroutput>exit()s</computeroutput> back to
114the kernel in the usual way.</para>
115
116<para>The normal course of activity, once Valgrind has started
117up, is as follows. Valgrind never runs any part of your program
118(usually referred to as the "client"), not a single byte of it,
119directly. Instead it uses function
120<computeroutput>VG_(translate)</computeroutput> to translate
121basic blocks (BBs, straight-line sequences of code) into
122instrumented translations, and those are run instead. The
123translations are stored in the translation cache (TC),
124<computeroutput>vg_tc</computeroutput>, with the translation
125table (TT), <computeroutput>vg_tt</computeroutput> supplying the
126original-to-translation code address mapping. Auxiliary array
127<computeroutput>VG_(tt_fast)</computeroutput> is used as a
128direct-map cache for fast lookups in TT; it usually achieves a
129hit rate of around 98% and facilitates an orig-to-trans lookup in
1304 x86 insns, which is not bad.</para>
131
132<para>Function <computeroutput>VG_(dispatch)</computeroutput> in
133<filename>vg_dispatch.S</filename> is the heart of the JIT
134dispatcher. Once a translated code address has been found, it is
135executed simply by an x86 <computeroutput>call</computeroutput>
136to the translation. At the end of the translation, the next
137original code addr is loaded into
138<computeroutput>%eax</computeroutput>, and the translation then
139does a <computeroutput>ret</computeroutput>, taking it back to
140the dispatch loop, with, interestingly, zero branch
141mispredictions. The address requested in
142<computeroutput>%eax</computeroutput> is looked up first in
143<computeroutput>VG_(tt_fast)</computeroutput>, and, if not found,
144by calling C helper
145<computeroutput>VG_(search_transtab)</computeroutput>. If there
146is still no translation available,
147<computeroutput>VG_(dispatch)</computeroutput> exits back to the
148top-level C dispatcher
149<computeroutput>VG_(toploop)</computeroutput>, which arranges for
150<computeroutput>VG_(translate)</computeroutput> to make a new
151translation. All fairly unsurprising, really. There are various
152complexities described below.</para>
153
154<para>The translator, orchestrated by
155<computeroutput>VG_(translate)</computeroutput>, is complicated
156but entirely self-contained. It is described in great detail in
157subsequent sections. Translations are stored in TC, with TT
158tracking administrative information. The translations are
159subject to an approximate LRU-based management scheme. With the
160current settings, the TC can hold at most about 15MB of
161translations, and LRU passes prune it to about 13.5MB. Given
162that the orig-to-translation expansion ratio is about 13:1 to
16314:1, this means TC holds translations for more or less a
164megabyte of original code, which generally comes to about 70000
165basic blocks for C++ compiled with optimisation on. Generating
166new translations is expensive, so it is worth having a large TC
167to minimise the (capacity) miss rate.</para>
168
169<para>The dispatcher,
170<computeroutput>VG_(dispatch)</computeroutput>, receives hints
171from the translations which allow it to cheaply spot all control
172transfers corresponding to x86
173<computeroutput>call</computeroutput> and
174<computeroutput>ret</computeroutput> instructions. It has to do
175this in order to spot some special events:</para>
176
177<itemizedlist>
178 <listitem>
179 <para>Calls to
180 <computeroutput>VG_(shutdown)</computeroutput>. This is
181 Valgrind's cue to exit. NOTE: actually this is done a
182 different way; it should be cleaned up.</para>
183 </listitem>
184
185 <listitem>
186 <para>Returns of system call handlers, to the return address
187 <computeroutput>VG_(signalreturn_bogusRA)</computeroutput>.
188 The signal simulator needs to know when a signal handler is
189 returning, so we spot jumps (returns) to this address.</para>
190 </listitem>
191
192 <listitem>
193 <para>Calls to <computeroutput>vg_trap_here</computeroutput>.
194 All <computeroutput>malloc</computeroutput>,
195 <computeroutput>free</computeroutput>, etc calls that the
196 client program makes are eventually routed to a call to
197 <computeroutput>vg_trap_here</computeroutput>, and Valgrind
198 does its own special thing with these calls. In effect this
199 provides a trapdoor, by which Valgrind can intercept certain
200 calls on the simulated CPU, run the call as it sees fit
201 itself (on the real CPU), and return the result to the
202 simulated CPU, quite transparently to the client
203 program.</para>
204 </listitem>
205
206</itemizedlist>
207
208<para>Valgrind intercepts the client's
209<computeroutput>malloc</computeroutput>,
210<computeroutput>free</computeroutput>, etc, calls, so that it can
211store additional information. Each block
212<computeroutput>malloc</computeroutput>'d by the client gives
213rise to a shadow block in which Valgrind stores the call stack at
214the time of the <computeroutput>malloc</computeroutput> call.
215When the client calls <computeroutput>free</computeroutput>,
216Valgrind tries to find the shadow block corresponding to the
217address passed to <computeroutput>free</computeroutput>, and
218emits an error message if none can be found. If it is found, the
219block is placed on the freed blocks queue
220<computeroutput>vg_freed_list</computeroutput>, it is marked as
221inaccessible, and its shadow block now records the call stack at
222the time of the <computeroutput>free</computeroutput> call.
223Keeping <computeroutput>free</computeroutput>'d blocks in this
224queue allows Valgrind to spot all (presumably invalid) accesses
225to them. However, once the volume of blocks in the free queue
226exceeds <computeroutput>VG_(clo_freelist_vol)</computeroutput>,
227blocks are finally removed from the queue.</para>
228
229<para>Keeping track of <literal>A</literal> and
230<literal>V</literal> bits (note: if you don't know what these
231are, you haven't read the user guide carefully enough) for memory
232is done in <filename>vg_memory.c</filename>. This implements a
233sparse array structure which covers the entire 4G address space
234in a way which is reasonably fast and reasonably space efficient.
235The 4G address space is divided up into 64K sections, each
236covering 64Kb of address space. Given a 32-bit address, the top
23716 bits are used to select one of the 65536 entries in
238<computeroutput>VG_(primary_map)</computeroutput>. The resulting
239"secondary" (<computeroutput>SecMap</computeroutput>) holds A and
240V bits for the 64k of address space chunk corresponding to the
241lower 16 bits of the address.</para>
242
243</sect2>
244
245
246
247<sect2 id="mc-tech-docs.design" xreflabel="Design decisions">
248<title>Design decisions</title>
249
250<para>Some design decisions were motivated by the need to make
251Valgrind debuggable. Imagine you are writing a CPU simulator.
252It works fairly well. However, you run some large program, like
253Netscape, and after tens of millions of instructions, it crashes.
254How can you figure out where in your simulator the bug is?</para>
255
256<para>Valgrind's answer is: cheat. Valgrind is designed so that
257it is possible to switch back to running the client program on
258the real CPU at any point. Using the
259<computeroutput>--stop-after= </computeroutput> flag, you can ask
260Valgrind to run just some number of basic blocks, and then run
261the rest of the way on the real CPU. If you are searching for a
262bug in the simulated CPU, you can use this to do a binary search,
263which quickly leads you to the specific basic block which is
264causing the problem.</para>
265
266<para>This is all very handy. It does constrain the design in
267certain unimportant ways. Firstly, the layout of memory, when
268viewed from the client's point of view, must be identical
269regardless of whether it is running on the real or simulated CPU.
270This means that Valgrind can't do pointer swizzling -- well, no
271great loss -- and it can't run on the same stack as the client --
272again, no great loss. Valgrind operates on its own stack,
273<computeroutput>VG_(stack)</computeroutput>, which it switches to
274at startup, temporarily switching back to the client's stack when
275doing system calls for the client.</para>
276
277<para>Valgrind also receives signals on its own stack,
278<computeroutput>VG_(sigstack)</computeroutput>, but for different
279gruesome reasons discussed below.</para>
280
281<para>This nice clean
282switch-back-to-the-real-CPU-whenever-you-like story is muddied by
283signals. Problem is that signals arrive at arbitrary times and
284tend to slightly perturb the basic block count, with the result
285that you can get close to the basic block causing a problem but
286can't home in on it exactly. My kludgey hack is to define
287<computeroutput>SIGNAL_SIMULATION</computeroutput> to 1 towards
288the bottom of <filename>vg_syscall_mem.c</filename>, so that
289signal handlers are run on the real CPU and don't change the BB
290counts.</para>
291
292<para>A second hole in the switch-back-to-real-CPU story is that
293Valgrind's way of delivering signals to the client is different
294from that of the kernel. Specifically, the layout of the signal
295delivery frame, and the mechanism used to detect a sighandler
296returning, are different. So you can't expect to make the
297transition inside a sighandler and still have things working, but
298in practice that's not much of a restriction.</para>
299
300<para>Valgrind's implementation of
301<computeroutput>malloc</computeroutput>,
302<computeroutput>free</computeroutput>, etc, (in
303<filename>vg_clientmalloc.c</filename>, not the low-level stuff
304in <filename>vg_malloc2.c</filename>) is somewhat complicated by
305the need to handle switching back at arbitrary points. It does
306work tho.</para>
307
308</sect2>
309
310
311
312<sect2 id="mc-tech-docs.correctness" xreflabel="Correctness">
313<title>Correctness</title>
314
315<para>There's only one of me, and I have a Real Life (tm) as well
316as hacking Valgrind [allegedly :-]. That means I don't have time
317to waste chasing endless bugs in Valgrind. My emphasis is
318therefore on doing everything as simply as possible, with
319correctness, stability and robustness being the number one
320priority, more important than performance or functionality. As a
321result:</para>
322
323<itemizedlist>
324
325 <listitem>
326 <para>The code is absolutely loaded with assertions, and
327 these are <command>permanently enabled.</command> I have no
328 plan to remove or disable them later. Over the past couple
329 of months, as valgrind has become more widely used, they have
330 shown their worth, pulling up various bugs which would
331 otherwise have appeared as hard-to-find segmentation
332 faults.</para>
333
334 <para>I am of the view that it's acceptable to spend 5% of
335 the total running time of your valgrindified program doing
336 assertion checks and other internal sanity checks.</para>
337 </listitem>
338
339 <listitem>
340 <para>Aside from the assertions, valgrind contains various
341 sets of internal sanity checks, which get run at varying
342 frequencies during normal operation.
343 <computeroutput>VG_(do_sanity_checks)</computeroutput> runs
344 every 1000 basic blocks, which means 500 to 2000 times/second
345 for typical machines at present. It checks that Valgrind
346 hasn't overrun its private stack, and does some simple checks
347 on the memory permissions maps. Once every 25 calls it does
348 some more extensive checks on those maps. Etc, etc.</para>
349 <para>The following components also have sanity check code,
350 which can be enabled to aid debugging:</para>
351 <itemizedlist>
352 <listitem><para>The low-level memory-manager
353 (<computeroutput>VG_(mallocSanityCheckArena)</computeroutput>).
354 This does a complete check of all blocks and chains in an
355 arena, which is very slow. Is not engaged by default.</para>
356 </listitem>
357
358 <listitem>
359 <para>The symbol table reader(s): various checks to
360 ensure uniqueness of mappings; see
361 <computeroutput>VG_(read_symbols)</computeroutput> for a
362 start. Is permanently engaged.</para>
363 </listitem>
364
365 <listitem>
366 <para>The A and V bit tracking stuff in
367 <filename>vg_memory.c</filename>. This can be compiled
368 with cpp symbol
369 <computeroutput>VG_DEBUG_MEMORY</computeroutput> defined,
370 which removes all the fast, optimised cases, and uses
371 simple-but-slow fallbacks instead. Not engaged by
372 default.</para>
373 </listitem>
374
375 <listitem>
376 <para>Ditto
377 <computeroutput>VG_DEBUG_LEAKCHECK</computeroutput>.</para>
378 </listitem>
379
380 <listitem>
381 <para>The JITter parses x86 basic blocks into sequences
382 of UCode instructions. It then sanity checks each one
383 with <computeroutput>VG_(saneUInstr)</computeroutput> and
384 sanity checks the sequence as a whole with
385 <computeroutput>VG_(saneUCodeBlock)</computeroutput>.
386 This stuff is engaged by default, and has caught some
387 way-obscure bugs in the simulated CPU machinery in its
388 time.</para>
389 </listitem>
390
391 <listitem>
392 <para>The system call wrapper does
393 <computeroutput>VG_(first_and_last_secondaries_look_plausible)</computeroutput>
394 after every syscall; this is known to pick up bugs in the
395 syscall wrappers. Engaged by default.</para>
396 </listitem>
397
398 <listitem>
399 <para>The main dispatch loop, in
400 <computeroutput>VG_(dispatch)</computeroutput>, checks
401 that translations do not set
402 <computeroutput>%ebp</computeroutput> to any value
403 different from
404 <computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput>
405 or <computeroutput>&amp; VG_(baseBlock)</computeroutput>.
406 In effect this test is free, and is permanently
407 engaged.</para>
408 </listitem>
409
410 <listitem>
411 <para>There are a couple of ifdefed-out consistency
412 checks I inserted whilst debugging the new register
413 allocater,
414 <computeroutput>vg_do_register_allocation</computeroutput>.</para>
415 </listitem>
416 </itemizedlist>
417 </listitem>
418
419 <listitem>
420 <para>I try to avoid techniques, algorithms, mechanisms, etc,
421 for which I can supply neither a convincing argument that
422 they are correct, nor sanity-check code which might pick up
423 bugs in my implementation. I don't always succeed in this,
424 but I try. Basically the idea is: avoid techniques which
425 are, in practice, unverifiable, in some sense. When doing
426 anything, always have in mind: "how can I verify that this is
427 correct?"</para>
428 </listitem>
429
430</itemizedlist>
431
432
433<para>Some more specific things are:</para>
434<itemizedlist>
435 <listitem>
436 <para>Valgrind runs in the same namespace as the client, at
437 least from <filename>ld.so</filename>'s point of view, and it
438 therefore absolutely had better not export any symbol with a
439 name which could clash with that of the client or any of its
440 libraries. Therefore, all globally visible symbols exported
441 from <filename>valgrind.so</filename> are defined using the
442 <computeroutput>VG_</computeroutput> CPP macro. As you'll
443 see from <filename>vg_constants.h</filename>, this appends
444 some arbitrary prefix to the symbol, in order that it be, we
445 hope, globally unique. Currently the prefix is
446 <computeroutput>vgPlain_</computeroutput>. For convenience
447 there are also <computeroutput>VGM_</computeroutput>,
448 <computeroutput>VGP_</computeroutput> and
449 <computeroutput>VGOFF_</computeroutput>. All locally defined
450 symbols are declared <computeroutput>static</computeroutput>
451 and do not appear in the final shared object.</para>
452
453 <para>To check this, I periodically do <computeroutput>nm
454 valgrind.so | grep " T "</computeroutput>, which shows you
455 all the globally exported text symbols. They should all have
456 an approved prefix, except for those like
457 <computeroutput>malloc</computeroutput>,
458 <computeroutput>free</computeroutput>, etc, which we
459 deliberately want to shadow and take precedence over the same
460 names exported from <filename>glibc.so</filename>, so that
461 valgrind can intercept those calls easily. Similarly,
462 <computeroutput>nm valgrind.so | grep " D "</computeroutput>
463 allows you to find any rogue data-segment symbol
464 names.</para>
465 </listitem>
466
467 <listitem>
468 <para>Valgrind tries, and almost succeeds, in being
469 completely independent of all other shared objects, in
470 particular of <filename>glibc.so</filename>. For example, we
471 have our own low-level memory manager in
472 <filename>vg_malloc2.c</filename>, which is a fairly standard
473 malloc/free scheme augmented with arenas, and
474 <filename>vg_mylibc.c</filename> exports reimplementations of
475 various bits and pieces you'd normally get from the C
476 library.</para>
477
478 <para>Why all the hassle? Because imagine the potential
479 chaos of both the simulated and real CPUs executing in
480 <filename>glibc.so</filename>. It just seems simpler and
481 cleaner to be completely self-contained, so that only the
482 simulated CPU visits <filename>glibc.so</filename>. In
483 practice it's not much hassle anyway. Also, valgrind starts
484 up before glibc has a chance to initialise itself, and who
485 knows what difficulties that could lead to. Finally, glibc
486 has definitions for some types, specifically
487 <computeroutput>sigset_t</computeroutput>, which conflict
488 (are different from) the Linux kernel's idea of same. When
489 Valgrind wants to fiddle around with signal stuff, it wants
490 to use the kernel's definitions, not glibc's definitions. So
491 it's simplest just to keep glibc out of the picture
492 entirely.</para>
493
494 <para>To find out which glibc symbols are used by Valgrind,
495 reinstate the link flags <computeroutput>-nostdlib
496 -Wl,-no-undefined</computeroutput>. This causes linking to
497 fail, but will tell you what you depend on. I have mostly,
498 but not entirely, got rid of the glibc dependencies; what
499 remains is, IMO, fairly harmless. AFAIK the current
500 dependencies are: <computeroutput>memset</computeroutput>,
501 <computeroutput>memcmp</computeroutput>,
502 <computeroutput>stat</computeroutput>,
503 <computeroutput>system</computeroutput>,
504 <computeroutput>sbrk</computeroutput>,
505 <computeroutput>setjmp</computeroutput> and
506 <computeroutput>longjmp</computeroutput>.</para>
507 </listitem>
508
509 <listitem>
510 <para>Similarly, valgrind should not really import any
511 headers other than the Linux kernel headers, since it knows
512 of no API other than the kernel interface to talk to. At the
513 moment this is really not in a good state, and
514 <computeroutput>vg_syscall_mem</computeroutput> imports, via
515 <filename>vg_unsafe.h</filename>, a significant number of
516 C-library headers so as to know the sizes of various structs
517 passed across the kernel boundary. This is of course
518 completely bogus, since there is no guarantee that the C
519 library's definitions of these structs matches those of the
520 kernel. I have started to sort this out using
521 <filename>vg_kerneliface.h</filename>, into which I had
522 intended to copy all kernel definitions which valgrind could
523 need, but this has not gotten very far. At the moment it
524 mostly contains definitions for
525 <computeroutput>sigset_t</computeroutput> and
526 <computeroutput>struct sigaction</computeroutput>, since the
527 kernel's definition for these really does clash with glibc's.
528 I plan to use a <computeroutput>vki_</computeroutput> prefix
529 on all these types and constants, to denote the fact that
530 they pertain to <command>V</command>algrind's
531 <command>K</command>ernel
532 <command>I</command>nterface.</para>
533
534 <para>Another advantage of having a
535 <filename>vg_kerneliface.h</filename> file is that it makes
536 it simpler to interface to a different kernel. Once can, for
537 example, easily imagine writing a new
538 <filename>vg_kerneliface.h</filename> for FreeBSD, or x86
539 NetBSD.</para>
540 </listitem>
541
542</itemizedlist>
543
544</sect2>
545
546
547
548<sect2 id="mc-tech-docs.limits" xreflabel="Current limitations">
549<title>Current limitations</title>
550
551<para>Support for weird (non-POSIX) signal stuff is patchy. Does
552anybody care?</para>
553
554</sect2>
555
556</sect1>
557
558
559
560
561
562<sect1 id="mc-tech-docs.jitter" xreflabel="The instrumenting JITter">
563<title>The instrumenting JITter</title>
564
565<para>This really is the heart of the matter. We begin with
566various side issues.</para>
567
568
569<sect2 id="mc-tech-docs.storage"
570 xreflabel="Run-time storage, and the use of host registers">
571<title>Run-time storage, and the use of host registers</title>
572
573<para>Valgrind translates client (original) basic blocks into
574instrumented basic blocks, which live in the translation cache
575TC, until either the client finishes or the translations are
576ejected from TC to make room for newer ones.</para>
577
578<para>Since it generates x86 code in memory, Valgrind has
579complete control of the use of registers in the translations.
580Now pay attention. I shall say this only once, and it is
581important you understand this. In what follows I will refer to
582registers in the host (real) cpu using their standard names,
583<computeroutput>%eax</computeroutput>,
584<computeroutput>%edi</computeroutput>, etc. I refer to registers
585in the simulated CPU by capitalising them:
586<computeroutput>%EAX</computeroutput>,
587<computeroutput>%EDI</computeroutput>, etc. These two sets of
588registers usually bear no direct relationship to each other;
589there is no fixed mapping between them. This naming scheme is
590used fairly consistently in the comments in the sources.</para>
591
592<para>Host registers, once things are up and running, are used as
593follows:</para>
594
595<itemizedlist>
596 <listitem>
597 <para><computeroutput>%esp</computeroutput>, the real stack
598 pointer, points somewhere in Valgrind's private stack area,
599 <computeroutput>VG_(stack)</computeroutput> or, transiently,
600 into its signal delivery stack,
601 <computeroutput>VG_(sigstack)</computeroutput>.</para>
602 </listitem>
603
604 <listitem>
605 <para><computeroutput>%edi</computeroutput> is used as a
606 temporary in code generation; it is almost always dead,
607 except when used for the
608 <computeroutput>Left</computeroutput> value-tag operations.</para>
609 </listitem>
610
611 <listitem>
612 <para><computeroutput>%eax</computeroutput>,
613 <computeroutput>%ebx</computeroutput>,
614 <computeroutput>%ecx</computeroutput>,
615 <computeroutput>%edx</computeroutput> and
616 <computeroutput>%esi</computeroutput> are available to
617 Valgrind's register allocator. They are dead (carry
618 unimportant values) in between translations, and are live
619 only in translations. The one exception to this is
620 <computeroutput>%eax</computeroutput>, which, as mentioned
621 far above, has a special significance to the dispatch loop
622 <computeroutput>VG_(dispatch)</computeroutput>: when a
623 translation returns to the dispatch loop,
624 <computeroutput>%eax</computeroutput> is expected to contain
625 the original-code-address of the next translation to run.
626 The register allocator is so good at minimising spill code
627 that using five regs and not having to save/restore
628 <computeroutput>%edi</computeroutput> actually gives better
629 code than allocating to <computeroutput>%edi</computeroutput>
630 as well, but then having to push/pop it around special
631 uses.</para>
632 </listitem>
633
634 <listitem>
635 <para><computeroutput>%ebp</computeroutput> points
636 permanently at
637 <computeroutput>VG_(baseBlock)</computeroutput>. Valgrind's
638 translations are position-independent, partly because this is
639 convenient, but also because translations get moved around in
640 TC as part of the LRUing activity. <command>All</command>
641 static entities which need to be referred to from generated
642 code, whether data or helper functions, are stored starting
643 at <computeroutput>VG_(baseBlock)</computeroutput> and are
644 therefore reached by indexing from
645 <computeroutput>%ebp</computeroutput>. There is but one
646 exception, which is that by placing the value
647 <computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput> in
648 <computeroutput>%ebp</computeroutput> just before a return to
649 the dispatcher, the dispatcher is informed that the next
650 address to run, in <computeroutput>%eax</computeroutput>,
651 requires special treatment.</para>
652 </listitem>
653
654 <listitem>
655 <para>The real machine's FPU state is pretty much
656 unimportant, for reasons which will become obvious. Ditto
657 its <computeroutput>%eflags</computeroutput> register.</para>
658 </listitem>
659
660</itemizedlist>
661
662<para>The state of the simulated CPU is stored in memory, in
663<computeroutput>VG_(baseBlock)</computeroutput>, which is a block
664of 200 words IIRC. Recall that
665<computeroutput>%ebp</computeroutput> points permanently at the
666start of this block. Function
667<computeroutput>vg_init_baseBlock</computeroutput> decides what
668the offsets of various entities in
669<computeroutput>VG_(baseBlock)</computeroutput> are to be, and
670allocates word offsets for them. The code generator then emits
671<computeroutput>%ebp</computeroutput> relative addresses to get
672at those things. The sequence in which entities are allocated
673has been carefully chosen so that the 32 most popular entities
674come first, because this means 8-bit offsets can be used in the
675generated code.</para>
676
677<para>If I was clever, I could make
678<computeroutput>%ebp</computeroutput> point 32 words along
679<computeroutput>VG_(baseBlock)</computeroutput>, so that I'd have
680another 32 words of short-form offsets available, but that's just
681complicated, and it's not important -- the first 32 words take
68299% (or whatever) of the traffic.</para>
683
684<para>Currently, the sequence of stuff in
685<computeroutput>VG_(baseBlock)</computeroutput> is as
686follows:</para>
687
688<itemizedlist>
689 <listitem>
690 <para>9 words, holding the simulated integer registers,
691 <computeroutput>%EAX</computeroutput>
692 .. <computeroutput>%EDI</computeroutput>, and the simulated
693 flags, <computeroutput>%EFLAGS</computeroutput>.</para>
694 </listitem>
695
696 <listitem>
697 <para>Another 9 words, holding the V bit "shadows" for the
698 above 9 regs.</para>
699 </listitem>
700
701 <listitem>
702 <para>The <command>addresses</command> of various helper
703 routines called from generated code:
704 <computeroutput>VG_(helper_value_check4_fail)</computeroutput>,
705 <computeroutput>VG_(helper_value_check0_fail)</computeroutput>,
706 which register V-check failures,
707 <computeroutput>VG_(helperc_STOREV4)</computeroutput>,
708 <computeroutput>VG_(helperc_STOREV1)</computeroutput>,
709 <computeroutput>VG_(helperc_LOADV4)</computeroutput>,
710 <computeroutput>VG_(helperc_LOADV1)</computeroutput>, which
711 do stores and loads of V bits to/from the sparse array which
712 keeps track of V bits in memory, and
713 <computeroutput>VGM_(handle_esp_assignment)</computeroutput>,
714 which messes with memory addressibility resulting from
715 changes in <computeroutput>%ESP</computeroutput>.</para>
716 </listitem>
717
718 <listitem>
719 <para>The simulated <computeroutput>%EIP</computeroutput>.</para>
720 </listitem>
721
722 <listitem>
723 <para>24 spill words, for when the register allocator can't
724 make it work with 5 measly registers.</para>
725 </listitem>
726
727 <listitem>
728 <para>Addresses of helpers
729 <computeroutput>VG_(helperc_STOREV2)</computeroutput>,
730 <computeroutput>VG_(helperc_LOADV2)</computeroutput>. These
731 are here because 2-byte loads and stores are relatively rare,
732 so are placed above the magic 32-word offset boundary.</para>
733 </listitem>
734
735 <listitem>
736 <para>For similar reasons, addresses of helper functions
737 <computeroutput>VGM_(fpu_write_check)</computeroutput> and
738 <computeroutput>VGM_(fpu_read_check)</computeroutput>, which
739 handle the A/V maps testing and changes required by FPU
740 writes/reads.</para>
741 </listitem>
742
743 <listitem>
744 <para>Some other boring helper addresses:
745 <computeroutput>VG_(helper_value_check2_fail)</computeroutput>
746 and
747 <computeroutput>VG_(helper_value_check1_fail)</computeroutput>.
748 These are probably never emitted now, and should be
749 removed.</para>
750 </listitem>
751
752 <listitem>
753 <para>The entire state of the simulated FPU, which I believe
754 to be 108 bytes long.</para>
755 </listitem>
756
757 <listitem>
758 <para>Finally, the addresses of various other helper
759 functions in <filename>vg_helpers.S</filename>, which deal
760 with rare situations which are tedious or difficult to
761 generate code in-line for.</para>
762 </listitem>
763
764</itemizedlist>
765
766<para>As a general rule, the simulated machine's state lives
767permanently in memory at
768<computeroutput>VG_(baseBlock)</computeroutput>. However, the
769JITter does some optimisations which allow the simulated integer
770registers to be cached in real registers over multiple simulated
771instructions within the same basic block. These are always
772flushed back into memory at the end of every basic block, so that
773the in-memory state is up-to-date between basic blocks. (This
774flushing is implied by the statement above that the real
775machine's allocatable registers are dead in between simulated
776blocks).</para>
777
778</sect2>
779
780
781
782<sect2 id="mc-tech-docs.startup"
783 xreflabel="Startup, shutdown, and system calls">
784<title>Startup, shutdown, and system calls</title>
785
786<para>Getting into of Valgrind
787(<computeroutput>VG_(startup)</computeroutput>, called from
788<filename>valgrind.so</filename>'s initialisation section),
789really means copying the real CPU's state into
790<computeroutput>VG_(baseBlock)</computeroutput>, and then
791installing our own stack pointer, etc, into the real CPU, and
792then starting up the JITter. Exiting valgrind involves copying
793the simulated state back to the real state.</para>
794
795<para>Unfortunately, there's a complication at startup time.
796Problem is that at the point where we need to take a snapshot of
797the real CPU's state, the offsets in
798<computeroutput>VG_(baseBlock)</computeroutput> are not set up
799yet, because to do so would involve disrupting the real machine's
800state significantly. The way round this is to dump the real
801machine's state into a temporary, static block of memory,
802<computeroutput>VG_(m_state_static)</computeroutput>. We can
803then set up the <computeroutput>VG_(baseBlock)</computeroutput>
804offsets at our leisure, and copy into it from
805<computeroutput>VG_(m_state_static)</computeroutput> at some
806convenient later time. This copying is done by
807<computeroutput>VG_(copy_m_state_static_to_baseBlock)</computeroutput>.</para>
808
809<para>On exit, the inverse transformation is (rather
810unnecessarily) used: stuff in
811<computeroutput>VG_(baseBlock)</computeroutput> is copied to
812<computeroutput>VG_(m_state_static)</computeroutput>, and the
813assembly stub then copies from
814<computeroutput>VG_(m_state_static)</computeroutput> into the
815real machine registers.</para>
816
817<para>Doing system calls on behalf of the client
818(<filename>vg_syscall.S</filename>) is something of a half-way
819house. We have to make the world look sufficiently like that
820which the client would normally have to make the syscall actually
821work properly, but we can't afford to lose control. So the trick
822is to copy all of the client's state, <command>except its program
823counter</command>, into the real CPU, do the system call, and
824copy the state back out. Note that the client's state includes
825its stack pointer register, so one effect of this partial
826restoration is to cause the system call to be run on the client's
827stack, as it should be.</para>
828
829<para>As ever there are complications. We have to save some of
830our own state somewhere when restoring the client's state into
831the CPU, so that we can keep going sensibly afterwards. In fact
832the only thing which is important is our own stack pointer, but
833for paranoia reasons I save and restore our own FPU state as
834well, even though that's probably pointless.</para>
835
836<para>The complication on the above complication is, that for
837horrible reasons to do with signals, we may have to handle a
838second client system call whilst the client is blocked inside
839some other system call (unbelievable!). That means there's two
840sets of places to dump Valgrind's stack pointer and FPU state
841across the syscall, and we decide which to use by consulting
842<computeroutput>VG_(syscall_depth)</computeroutput>, which is in
843turn maintained by
844<computeroutput>VG_(wrap_syscall)</computeroutput>.</para>
845
846</sect2>
847
848
849
850<sect2 id="mc-tech-docs.ucode" xreflabel="Introduction to UCode">
851<title>Introduction to UCode</title>
852
853<para>UCode lies at the heart of the x86-to-x86 JITter. The
854basic premise is that dealing the the x86 instruction set head-on
855is just too darn complicated, so we do the traditional
856compiler-writer's trick and translate it into a simpler,
857easier-to-deal-with form.</para>
858
859<para>In normal operation, translation proceeds through six
860stages, coordinated by
861<computeroutput>VG_(translate)</computeroutput>:</para>
862
863<orderedlist>
864 <listitem>
865 <para>Parsing of an x86 basic block into a sequence of UCode
866 instructions (<computeroutput>VG_(disBB)</computeroutput>).</para>
867 </listitem>
868
869 <listitem>
870 <para>UCode optimisation
871 (<computeroutput>vg_improve</computeroutput>), with the aim
872 of caching simulated registers in real registers over
873 multiple simulated instructions, and removing redundant
874 simulated <computeroutput>%EFLAGS</computeroutput>
875 saving/restoring.</para>
876 </listitem>
877
878 <listitem>
879 <para>UCode instrumentation
880 (<computeroutput>vg_instrument</computeroutput>), which adds
881 value and address checking code.</para>
882 </listitem>
883
884 <listitem>
885 <para>Post-instrumentation cleanup
886 (<computeroutput>vg_cleanup</computeroutput>), removing
887 redundant value-check computations.</para>
888 </listitem>
889
890 <listitem>
891 <para>Register allocation
892 (<computeroutput>vg_do_register_allocation</computeroutput>),
893 which, note, is done on UCode.</para>
894 </listitem>
895
896 <listitem>
897 <para>Emission of final instrumented x86 code
898 (<computeroutput>VG_(emit_code)</computeroutput>).</para>
899 </listitem>
900
901</orderedlist>
902
903<para>Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
904transformation passes, all on straight-line blocks of UCode (type
905<computeroutput>UCodeBlock</computeroutput>). Steps 2 and 4 are
906optimisation passes and can be disabled for debugging purposes,
907with <computeroutput>--optimise=no</computeroutput> and
908<computeroutput>--cleanup=no</computeroutput> respectively.</para>
909
910<para>Valgrind can also run in a no-instrumentation mode, given
911<computeroutput>--instrument=no</computeroutput>. This is useful
912for debugging the JITter quickly without having to deal with the
913complexity of the instrumentation mechanism too. In this mode,
914steps 3 and 4 are omitted.</para>
915
916<para>These flags combine, so that
917<computeroutput>--instrument=no</computeroutput> together with
918<computeroutput>--optimise=no</computeroutput> means only steps
9191, 5 and 6 are used.
920<computeroutput>--single-step=yes</computeroutput> causes each
921x86 instruction to be treated as a single basic block. The
922translations are terrible but this is sometimes instructive.</para>
923
924<para>The <computeroutput>--stop-after=N</computeroutput> flag
925switches back to the real CPU after
926<computeroutput>N</computeroutput> basic blocks. It also re-JITs
927the final basic block executed and prints the debugging info
928resulting, so this gives you a way to get a quick snapshot of how
929a basic block looks as it passes through the six stages mentioned
930above. If you want to see full information for every block
931translated (probably not, but still ...) find, in
932<computeroutput>VG_(translate)</computeroutput>, the lines</para>
933<programlisting><![CDATA[
934dis = True;
935dis = debugging_translation;]]></programlisting>
936
937<para>and comment out the second line. This will spew out
938debugging junk faster than you can possibly imagine.</para>
939
940</sect2>
941
942
943
944<sect2 id="mc-tech-docs.tags" xreflabel="UCode operand tags: type 'Tag'">
945<title>UCode operand tags: type <computeroutput>Tag</computeroutput></title>
946
947<para>UCode is, more or less, a simple two-address RISC-like
948code. In keeping with the x86 AT&amp;T assembly syntax,
949generally speaking the first operand is the source operand, and
950the second is the destination operand, which is modified when the
951uinstr is notionally executed.</para>
952
953<para>UCode instructions have up to three operand fields, each of
954which has a corresponding <computeroutput>Tag</computeroutput>
955describing it. Possible values for the tag are:</para>
956
957<itemizedlist>
958
959 <listitem>
960 <para><computeroutput>NoValue</computeroutput>: indicates
961 that the field is not in use.</para>
962 </listitem>
963
964 <listitem>
965 <para><computeroutput>Lit16</computeroutput>: the field
966 contains a 16-bit literal.</para>
967 </listitem>
968
969 <listitem>
970 <para><computeroutput>Literal</computeroutput>: the field
971 denotes a 32-bit literal, whose value is stored in the
972 <computeroutput>lit32</computeroutput> field of the uinstr
973 itself. Since there is only one
974 <computeroutput>lit32</computeroutput> for the whole uinstr,
975 only one operand field may contain this tag.</para>
976 </listitem>
977
978 <listitem>
979 <para><computeroutput>SpillNo</computeroutput>: the field
980 contains a spill slot number, in the range 0 to 23 inclusive,
981 denoting one of the spill slots contained inside
982 <computeroutput>VG_(baseBlock)</computeroutput>. Such tags
983 only exist after register allocation.</para>
984 </listitem>
985
986 <listitem>
987 <para><computeroutput>RealReg</computeroutput>: the field
988 contains a number in the range 0 to 7 denoting an integer x86
989 ("real") register on the host. The number is the Intel
990 encoding for integer registers. Such tags only exist after
991 register allocation.</para>
992 </listitem>
993
994 <listitem>
995 <para><computeroutput>ArchReg</computeroutput>: the field
996 contains a number in the range 0 to 7 denoting an integer x86
997 register on the simulated CPU. In reality this means a
998 reference to one of the first 8 words of
999 <computeroutput>VG_(baseBlock)</computeroutput>. Such tags
1000 can exist at any point in the translation process.</para>
1001 </listitem>
1002
1003 <listitem>
1004 <para>Last, but not least,
1005 <computeroutput>TempReg</computeroutput>. The field contains
1006 the number of one of an infinite set of virtual (integer)
1007 registers. <computeroutput>TempReg</computeroutput>s are used
1008 everywhere throughout the translation process; you can have
1009 as many as you want. The register allocator maps as many as
1010 it can into <computeroutput>RealReg</computeroutput>s and
1011 turns the rest into
1012 <computeroutput>SpillNo</computeroutput>s, so
1013 <computeroutput>TempReg</computeroutput>s should not exist
1014 after the register allocation phase.</para>
1015
1016 <para><computeroutput>TempReg</computeroutput>s are always 32
1017 bits long, even if the data they hold is logically shorter.
1018 In that case the upper unused bits are required, and, I
1019 think, generally assumed, to be zero.
1020 <computeroutput>TempReg</computeroutput>s holding V bits for
1021 quantities shorter than 32 bits are expected to have ones in
1022 the unused places, since a one denotes "undefined".</para>
1023 </listitem>
1024
1025</itemizedlist>
1026
1027</sect2>
1028
1029
1030
1031<sect2 id="mc-tech-docs.uinstr"
1032 xreflabel="UCode instructions: type 'UInstr'">
1033<title>UCode instructions: type <computeroutput>UInstr</computeroutput></title>
1034
1035<para>UCode was carefully designed to make it possible to do
1036register allocation on UCode and then translate the result into
1037x86 code without needing any extra registers ... well, that was
1038the original plan, anyway. Things have gotten a little more
1039complicated since then. In what follows, UCode instructions are
1040referred to as uinstrs, to distinguish them from x86
1041instructions. Uinstrs of course have uopcodes which are
1042(naturally) different from x86 opcodes.</para>
1043
1044<para>A uinstr (type <computeroutput>UInstr</computeroutput>)
1045contains various fields, not all of which are used by any one
1046uopcode:</para>
1047
1048<itemizedlist>
1049
1050 <listitem>
1051 <para>Three 16-bit operand fields,
1052 <computeroutput>val1</computeroutput>,
1053 <computeroutput>val2</computeroutput> and
1054 <computeroutput>val3</computeroutput>.</para>
1055 </listitem>
1056
1057 <listitem>
1058 <para>Three tag fields,
1059 <computeroutput>tag1</computeroutput>,
1060 <computeroutput>tag2</computeroutput> and
1061 <computeroutput>tag3</computeroutput>. Each of these has a
1062 value of type <computeroutput>Tag</computeroutput>, and they
1063 describe what the <computeroutput>val1</computeroutput>,
1064 <computeroutput>val2</computeroutput> and
1065 <computeroutput>val3</computeroutput> fields contain.</para>
1066 </listitem>
1067
1068 <listitem>
1069 <para>A 32-bit literal field.</para>
1070 </listitem>
1071
1072 <listitem>
1073 <para>Two <computeroutput>FlagSet</computeroutput>s,
1074 specifying which x86 condition codes are read and written by
1075 the uinstr.</para>
1076 </listitem>
1077
1078 <listitem>
1079 <para>An opcode byte, containing a value of type
1080 <computeroutput>Opcode</computeroutput>.</para>
1081 </listitem>
1082
1083 <listitem>
1084 <para>A size field, indicating the data transfer size
1085 (1/2/4/8/10) in cases where this makes sense, or zero
1086 otherwise.</para>
1087 </listitem>
1088
1089 <listitem>
1090 <para>A condition-code field, which, for jumps, holds a value
1091 of type <computeroutput>Condcode</computeroutput>, indicating
1092 the condition which applies. The encoding is as it is in the
1093 x86 insn stream, except we add a 17th value
1094 <computeroutput>CondAlways</computeroutput> to indicate an
1095 unconditional transfer.</para>
1096 </listitem>
1097
1098 <listitem>
1099 <para>Various 1-bit flags, indicating whether this insn
1100 pertains to an x86 CALL or RET instruction, whether a
1101 widening is signed or not, etc.</para>
1102 </listitem>
1103
1104</itemizedlist>
1105
1106<para>UOpcodes (type <computeroutput>Opcode</computeroutput>) are
1107divided into two groups: those necessary merely to express the
1108functionality of the x86 code, and extra uopcodes needed to
1109express the instrumentation. The former group contains:</para>
1110
1111<itemizedlist>
1112
1113 <listitem>
1114 <para><computeroutput>GET</computeroutput> and
1115 <computeroutput>PUT</computeroutput>, which move values from
1116 the simulated CPU's integer registers
1117 (<computeroutput>ArchReg</computeroutput>s) into
1118 <computeroutput>TempReg</computeroutput>s, and back.
1119 <computeroutput>GETF</computeroutput> and
1120 <computeroutput>PUTF</computeroutput> do the corresponding
1121 thing for the simulated
1122 <computeroutput>%EFLAGS</computeroutput>. There are no
1123 corresponding insns for the FPU register stack, since we
1124 don't explicitly simulate its registers.</para>
1125 </listitem>
1126
1127 <listitem>
1128 <para><computeroutput>LOAD</computeroutput> and
1129 <computeroutput>STORE</computeroutput>, which, in RISC-like
1130 fashion, are the only uinstrs able to interact with
1131 memory.</para>
1132 </listitem>
1133
1134 <listitem>
1135 <para><computeroutput>MOV</computeroutput> and
1136 <computeroutput>CMOV</computeroutput> allow unconditional and
1137 conditional moves of values between
1138 <computeroutput>TempReg</computeroutput>s.</para>
1139 </listitem>
1140
1141 <listitem>
1142 <para>ALU operations. Again in RISC-like fashion, these only
1143 operate on <computeroutput>TempReg</computeroutput>s (before
1144 reg-alloc) or <computeroutput>RealReg</computeroutput>s
1145 (after reg-alloc). These are:
1146 <computeroutput>ADD</computeroutput>,
1147 <computeroutput>ADC</computeroutput>,
1148 <computeroutput>AND</computeroutput>,
1149 <computeroutput>OR</computeroutput>,
1150 <computeroutput>XOR</computeroutput>,
1151 <computeroutput>SUB</computeroutput>,
1152 <computeroutput>SBB</computeroutput>,
1153 <computeroutput>SHL</computeroutput>,
1154 <computeroutput>SHR</computeroutput>,
1155 <computeroutput>SAR</computeroutput>,
1156 <computeroutput>ROL</computeroutput>,
1157 <computeroutput>ROR</computeroutput>,
1158 <computeroutput>RCL</computeroutput>,
1159 <computeroutput>RCR</computeroutput>,
1160 <computeroutput>NOT</computeroutput>,
1161 <computeroutput>NEG</computeroutput>,
1162 <computeroutput>INC</computeroutput>,
1163 <computeroutput>DEC</computeroutput>,
1164 <computeroutput>BSWAP</computeroutput>,
1165 <computeroutput>CC2VAL</computeroutput> and
1166 <computeroutput>WIDEN</computeroutput>.
1167 <computeroutput>WIDEN</computeroutput> does signed or
1168 unsigned value widening.
1169 <computeroutput>CC2VAL</computeroutput> is used to convert
1170 condition codes into a value, zero or one. The rest are
1171 obvious.</para>
1172
1173 <para>To allow for more efficient code generation, we bend
1174 slightly the restriction at the start of the previous para:
1175 for <computeroutput>ADD</computeroutput>,
1176 <computeroutput>ADC</computeroutput>,
1177 <computeroutput>XOR</computeroutput>,
1178 <computeroutput>SUB</computeroutput> and
1179 <computeroutput>SBB</computeroutput>, we allow the first
1180 (source) operand to also be an
1181 <computeroutput>ArchReg</computeroutput>, that is, one of the
1182 simulated machine's registers. Also, many of these ALU ops
1183 allow the source operand to be a literal. See
1184 <computeroutput>VG_(saneUInstr)</computeroutput> for the
1185 final word on the allowable forms of uinstrs.</para>
1186 </listitem>
1187
1188 <listitem>
1189 <para><computeroutput>LEA1</computeroutput> and
1190 <computeroutput>LEA2</computeroutput> are not strictly
1191 necessary, but allow faciliate better translations. They
1192 record the fancy x86 addressing modes in a direct way, which
1193 allows those amodes to be emitted back into the final
1194 instruction stream more or less verbatim.</para>
1195 </listitem>
1196
1197 <listitem>
1198 <para><computeroutput>CALLM</computeroutput> calls a
1199 machine-code helper, one of the methods whose address is
1200 stored at some
1201 <computeroutput>VG_(baseBlock)</computeroutput> offset.
1202 <computeroutput>PUSH</computeroutput> and
1203 <computeroutput>POP</computeroutput> move values to/from
1204 <computeroutput>TempReg</computeroutput> to the real
1205 (Valgrind's) stack, and
1206 <computeroutput>CLEAR</computeroutput> removes values from
1207 the stack. <computeroutput>CALLM_S</computeroutput> and
1208 <computeroutput>CALLM_E</computeroutput> delimit the
1209 boundaries of call setups and clearings, for the benefit of
1210 the instrumentation passes. Getting this right is critical,
1211 and so <computeroutput>VG_(saneUCodeBlock)</computeroutput>
1212 makes various checks on the use of these uopcodes.</para>
1213
1214 <para>It is important to understand that these uopcodes have
1215 nothing to do with the x86
1216 <computeroutput>call</computeroutput>,
1217 <computeroutput>return,</computeroutput>
1218 <computeroutput>push</computeroutput> or
1219 <computeroutput>pop</computeroutput> instructions, and are
1220 not used to implement them. Those guys turn into
1221 combinations of <computeroutput>GET</computeroutput>,
1222 <computeroutput>PUT</computeroutput>,
1223 <computeroutput>LOAD</computeroutput>,
1224 <computeroutput>STORE</computeroutput>,
1225 <computeroutput>ADD</computeroutput>,
1226 <computeroutput>SUB</computeroutput>, and
1227 <computeroutput>JMP</computeroutput>. What these uopcodes
1228 support is calling of helper functions such as
1229 <computeroutput>VG_(helper_imul_32_64)</computeroutput>,
1230 which do stuff which is too difficult or tedious to emit
1231 inline.</para>
1232 </listitem>
1233
1234 <listitem>
1235 <para><computeroutput>FPU</computeroutput>,
1236 <computeroutput>FPU_R</computeroutput> and
1237 <computeroutput>FPU_W</computeroutput>. Valgrind doesn't
1238 attempt to simulate the internal state of the FPU at all.
1239 Consequently it only needs to be able to distinguish FPU ops
1240 which read and write memory from those that don't, and for
1241 those which do, it needs to know the effective address and
1242 data transfer size. This is made easier because the x86 FP
1243 instruction encoding is very regular, basically consisting of
1244 16 bits for a non-memory FPU insn and 11 (IIRC) bits + an
1245 address mode for a memory FPU insn. So our
1246 <computeroutput>FPU</computeroutput> uinstr carries the 16
1247 bits in its <computeroutput>val1</computeroutput> field. And
1248 <computeroutput>FPU_R</computeroutput> and
1249 <computeroutput>FPU_W</computeroutput> carry 11 bits in that
1250 field, together with the identity of a
1251 <computeroutput>TempReg</computeroutput> or (later)
1252 <computeroutput>RealReg</computeroutput> which contains the
1253 address.</para>
1254 </listitem>
1255
1256 <listitem>
1257 <para><computeroutput>JIFZ</computeroutput> is unique, in
1258 that it allows a control-flow transfer which is not deemed to
1259 end a basic block. It causes a jump to a literal (original)
1260 address if the specified argument is zero.</para>
1261 </listitem>
1262
1263 <listitem>
1264 <para>Finally, <computeroutput>INCEIP</computeroutput>
1265 advances the simulated <computeroutput>%EIP</computeroutput>
1266 by the specified literal amount. This supports lazy
1267 <computeroutput>%EIP</computeroutput> updating, as described
1268 below.</para>
1269 </listitem>
1270
1271</itemizedlist>
1272
1273<para>Stages 1 and 2 of the 6-stage translation process mentioned
1274above deal purely with these uopcodes, and no others. They are
1275sufficient to express pretty much all the x86 32-bit
1276protected-mode instruction set, at least everything understood by
1277a pre-MMX original Pentium (P54C).</para>
1278
1279<para>Stages 3, 4, 5 and 6 also deal with the following extra
1280"instrumentation" uopcodes. They are used to express all the
1281definedness-tracking and -checking machinery which valgrind does.
1282In later sections we show how to create checking code for each of
1283the uopcodes above. Note that these instrumentation uopcodes,
1284although some appearing complicated, have been carefully chosen
1285so that efficient x86 code can be generated for them. GNU
1286superopt v2.5 did a great job helping out here. Anyways, the
1287uopcodes are as follows:</para>
1288
1289<itemizedlist>
1290
1291 <listitem>
1292 <para><computeroutput>GETV</computeroutput> and
1293 <computeroutput>PUTV</computeroutput> are analogues to
1294 <computeroutput>GET</computeroutput> and
1295 <computeroutput>PUT</computeroutput> above. They are
1296 identical except that they move the V bits for the specified
1297 values back and forth to
1298 <computeroutput>TempRegs</computeroutput>, rather than moving
1299 the values themselves.</para>
1300 </listitem>
1301
1302 <listitem>
1303 <para>Similarly, <computeroutput>LOADV</computeroutput> and
1304 <computeroutput>STOREV</computeroutput> read and write V bits
1305 from the synthesised shadow memory that Valgrind maintains.
1306 In fact they do more than that, since they also do
1307 address-validity checks, and emit complaints if the
1308 read/written addresses are unaddressible.</para>
1309 </listitem>
1310
1311 <listitem>
1312 <para><computeroutput>TESTV</computeroutput>, whose
1313 parameters are a <computeroutput>TempReg</computeroutput> and
1314 a size, tests the V bits in the
1315 <computeroutput>TempReg</computeroutput>, at the specified
1316 operation size (0/1/2/4 byte) and emits an error if any of
1317 them indicate undefinedness. This is the only uopcode
1318 capable of doing such tests.</para>
1319 </listitem>
1320
1321 <listitem>
1322 <para><computeroutput>SETV</computeroutput>, whose parameters
1323 are also <computeroutput>TempReg</computeroutput> and a size,
1324 makes the V bits in the
1325 <computeroutput>TempReg</computeroutput> indicated
1326 definedness, at the specified operation size. This is
1327 usually used to generate the correct V bits for a literal
1328 value, which is of course fully defined.</para>
1329 </listitem>
1330
1331 <listitem>
1332 <para><computeroutput>GETVF</computeroutput> and
1333 <computeroutput>PUTVF</computeroutput> are analogues to
1334 <computeroutput>GETF</computeroutput> and
1335 <computeroutput>PUTF</computeroutput>. They move the single
1336 V bit used to model definedness of
1337 <computeroutput>%EFLAGS</computeroutput> between its home in
1338 <computeroutput>VG_(baseBlock)</computeroutput> and the
1339 specified <computeroutput>TempReg</computeroutput>.</para>
1340 </listitem>
1341
1342 <listitem>
1343 <para><computeroutput>TAG1</computeroutput> denotes one of a
1344 family of unary operations on
1345 <computeroutput>TempReg</computeroutput>s containing V bits.
1346 Similarly, <computeroutput>TAG2</computeroutput> denotes one
1347 in a family of binary operations on V bits.</para>
1348 </listitem>
1349
1350</itemizedlist>
1351
1352
1353<para>These 10 uopcodes are sufficient to express Valgrind's
1354entire definedness-checking semantics. In fact most of the
1355interesting magic is done by the
1356<computeroutput>TAG1</computeroutput> and
1357<computeroutput>TAG2</computeroutput> suboperations.</para>
1358
1359<para>First, however, I need to explain about V-vector operation
1360sizes. There are 4 sizes: 1, 2 and 4, which operate on groups of
13618, 16 and 32 V bits at a time, supporting the usual 1, 2 and 4
1362byte x86 operations. However there is also the mysterious size
13630, which really means a single V bit. Single V bits are used in
1364various circumstances; in particular, the definedness of
1365<computeroutput>%EFLAGS</computeroutput> is modelled with a
1366single V bit. Now might be a good time to also point out that
1367for V bits, 1 means "undefined" and 0 means "defined".
1368Similarly, for A bits, 1 means "invalid address" and 0 means
1369"valid address". This seems counterintuitive (and so it is), but
1370testing against zero on x86s saves instructions compared to
1371testing against all 1s, because many ALU operations set the Z
1372flag for free, so to speak.</para>
1373
1374<para>With that in mind, the tag ops are:</para>
1375
1376<itemizedlist>
1377
1378 <listitem>
1379 <formalpara>
1380 <title>(UNARY) Pessimising casts:</title>
1381 <para><computeroutput>VgT_PCast40</computeroutput>,
1382 <computeroutput>VgT_PCast20</computeroutput>,
1383 <computeroutput>VgT_PCast10</computeroutput>,
1384 <computeroutput>VgT_PCast01</computeroutput>,
1385 <computeroutput>VgT_PCast02</computeroutput> and
1386 <computeroutput>VgT_PCast04</computeroutput>. A "pessimising
1387 cast" takes a V-bit vector at one size, and creates a new one
1388 at another size, pessimised in the sense that if any of the
1389 bits in the source vector indicate undefinedness, then all
1390 the bits in the result indicate undefinedness. In this case
1391 the casts are all to or from a single V bit, so for example
1392 <computeroutput>VgT_PCast40</computeroutput> is a pessimising
1393 cast from 32 bits to 1, whereas
1394 <computeroutput>VgT_PCast04</computeroutput> simply copies
1395 the single source V bit into all 32 bit positions in the
1396 result. Surprisingly, these ops can all be implemented very
1397 efficiently.</para>
1398 </formalpara>
1399
1400 <para>There are also the pessimising casts
1401 <computeroutput>VgT_PCast14</computeroutput>, from 8 bits to
1402 32, <computeroutput>VgT_PCast12</computeroutput>, from 8 bits
1403 to 16, and <computeroutput>VgT_PCast11</computeroutput>, from
1404 8 bits to 8. This last one seems nonsensical, but in fact it
1405 isn't a no-op because, as mentioned above, any undefined (1)
1406 bits in the source infect the entire result.</para>
1407 </listitem>
1408
1409 <listitem>
1410 <formalpara>
1411 <title>(UNARY) Propagating undefinedness upwards in a
1412 word:</title>
1413 <para><computeroutput>VgT_Left4</computeroutput>,
1414 <computeroutput>VgT_Left2</computeroutput> and
1415 <computeroutput>VgT_Left1</computeroutput>. These are used
1416 to simulate the worst-case effects of carry propagation in
1417 adds and subtracts. They return a V vector identical to the
1418 original, except that if the original contained any undefined
1419 bits, then it and all bits above it are marked as undefined
1420 too. Hence the Left bit in the names.</para></formalpara>
1421 </listitem>
1422
1423 <listitem>
1424 <formalpara>
1425 <title>(UNARY) Signed and unsigned value widening:</title>
1426 <para><computeroutput>VgT_SWiden14</computeroutput>,
1427 <computeroutput>VgT_SWiden24</computeroutput>,
1428 <computeroutput>VgT_SWiden12</computeroutput>,
1429 <computeroutput>VgT_ZWiden14</computeroutput>,
1430 <computeroutput>VgT_ZWiden24</computeroutput> and
1431 <computeroutput>VgT_ZWiden12</computeroutput>. These mimic
1432 the definedness effects of standard signed and unsigned
1433 integer widening. Unsigned widening creates zero bits in the
1434 new positions, so
1435 <computeroutput>VgT_ZWiden*</computeroutput> accordingly park
1436 mark those parts of their argument as defined. Signed
1437 widening copies the sign bit into the new positions, so
1438 <computeroutput>VgT_SWiden*</computeroutput> copies the
1439 definedness of the sign bit into the new positions. Because
1440 1 means undefined and 0 means defined, these operations can
1441 (fascinatingly) be done by the same operations which they
1442 mimic. Go figure.</para>
1443 </formalpara>
1444 </listitem>
1445
1446 <listitem>
1447 <formalpara>
1448 <title>(BINARY) Undefined-if-either-Undefined,
1449 Defined-if-either-Defined:</title>
1450 <para><computeroutput>VgT_UifU4</computeroutput>,
1451 <computeroutput>VgT_UifU2</computeroutput>,
1452 <computeroutput>VgT_UifU1</computeroutput>,
1453 <computeroutput>VgT_UifU0</computeroutput>,
1454 <computeroutput>VgT_DifD4</computeroutput>,
1455 <computeroutput>VgT_DifD2</computeroutput>,
1456 <computeroutput>VgT_DifD1</computeroutput>. These do simple
1457 bitwise operations on pairs of V-bit vectors, with
1458 <computeroutput>UifU</computeroutput> giving undefined if
1459 either arg bit is undefined, and
1460 <computeroutput>DifD</computeroutput> giving defined if
1461 either arg bit is defined. Abstract interpretation junkies,
1462 if any make it this far, may like to think of them as meets
1463 and joins (or is it joins and meets) in the definedness
1464 lattices.</para>
1465 </formalpara>
1466 </listitem>
1467
1468 <listitem>
1469 <formalpara>
1470 <title>(BINARY; one value, one V bits) Generate argument
1471 improvement terms for AND and OR</title>
1472 <para><computeroutput>VgT_ImproveAND4_TQ</computeroutput>,
1473 <computeroutput>VgT_ImproveAND2_TQ</computeroutput>,
1474 <computeroutput>VgT_ImproveAND1_TQ</computeroutput>,
1475 <computeroutput>VgT_ImproveOR4_TQ</computeroutput>,
1476 <computeroutput>VgT_ImproveOR2_TQ</computeroutput>,
1477 <computeroutput>VgT_ImproveOR1_TQ</computeroutput>. These
1478 help out with AND and OR operations. AND and OR have the
1479 inconvenient property that the definedness of the result
1480 depends on the actual values of the arguments as well as
1481 their definedness. At the bit level:</para></formalpara>
1482<programlisting><![CDATA[
14831 AND undefined = undefined, but
14840 AND undefined = 0, and
1485similarly
14860 OR undefined = undefined, but
14871 OR undefined = 1.]]></programlisting>
1488
1489 <para>It turns out that gcc (quite legitimately) generates
1490 code which relies on this fact, so we have to model it
1491 properly in order to avoid flooding users with spurious value
1492 errors. The ultimate definedness result of AND and OR is
1493 calculated using <computeroutput>UifU</computeroutput> on the
1494 definedness of the arguments, but we also
1495 <computeroutput>DifD</computeroutput> in some "improvement"
1496 terms which take into account the above phenomena.</para>
1497
1498 <para><computeroutput>ImproveAND</computeroutput> takes as
1499 its first argument the actual value of an argument to AND
1500 (the T) and the definedness of that argument (the Q), and
1501 returns a V-bit vector which is defined (0) for bits which
1502 have value 0 and are defined; this, when
1503 <computeroutput>DifD</computeroutput> into the final result
1504 causes those bits to be defined even if the corresponding bit
1505 in the other argument is undefined.</para>
1506
1507 <para>The <computeroutput>ImproveOR</computeroutput> ops do
1508 the dual thing for OR arguments. Note that XOR does not have
1509 this property that one argument can make the other
1510 irrelevant, so there is no need for such complexity for
1511 XOR.</para>
1512 </listitem>
1513
1514</itemizedlist>
1515
1516<para>That's all the tag ops. If you stare at this long enough,
1517and then run Valgrind and stare at the pre- and post-instrumented
1518ucode, it should be fairly obvious how the instrumentation
1519machinery hangs together.</para>
1520
1521<para>One point, if you do this: in order to make it easy to
1522differentiate <computeroutput>TempReg</computeroutput>s carrying
1523values from <computeroutput>TempReg</computeroutput>s carrying V
1524bit vectors, Valgrind prints the former as (for example)
1525<computeroutput>t28</computeroutput> and the latter as
1526<computeroutput>q28</computeroutput>; the fact that they carry
1527the same number serves to indicate their relationship. This is
1528purely for the convenience of the human reader; the register
1529allocator and code generator don't regard them as
1530different.</para>
1531
1532</sect2>
1533
1534
1535
1536<sect2 id="mc-manual.trans" xreflabel="Translation into UCode">
1537<title>Translation into UCode</title>
1538
1539<para><computeroutput>VG_(disBB)</computeroutput> allocates a new
1540<computeroutput>UCodeBlock</computeroutput> and then uses
1541<computeroutput>disInstr</computeroutput> to translate x86
1542instructions one at a time into UCode, dumping the result in the
1543<computeroutput>UCodeBlock</computeroutput>. This goes on until
1544a control-flow transfer instruction is encountered.</para>
1545
1546<para>Despite the large size of
1547<filename>vg_to_ucode.c</filename>, this translation is really
1548very simple. Each x86 instruction is translated entirely
1549independently of its neighbours, merrily allocating new
1550<computeroutput>TempReg</computeroutput>s as it goes. The idea
1551is to have a simple translator -- in reality, no more than a
1552macro-expander -- and the -- resulting bad UCode translation is
1553cleaned up by the UCode optimisation phase which follows. To
1554give you an idea of some x86 instructions and their translations
1555(this is a complete basic block, as Valgrind sees it):</para>
1556<programlisting><![CDATA[
15570x40435A50: incl %edx
1558 0: GETL %EDX, t0
1559 1: INCL t0 (-wOSZAP)
1560 2: PUTL t0, %EDX
1561
15620x40435A51: movsbl (%edx),%eax
1563 3: GETL %EDX, t2
1564 4: LDB (t2), t2
1565 5: WIDENL_Bs t2
1566 6: PUTL t2, %EAX
1567
15680x40435A54: testb $0x20, 1(%ecx,%eax,2)
1569 7: GETL %EAX, t6
1570 8: GETL %ECX, t8
1571 9: LEA2L 1(t8,t6,2), t4
1572 10: LDB (t4), t10
1573 11: MOVB $0x20, t12
1574 12: ANDB t12, t10 (-wOSZACP)
1575 13: INCEIPo $9
1576
15770x40435A59: jnz-8 0x40435A50
1578 14: Jnzo $0x40435A50 (-rOSZACP)
1579 15: JMPo $0x40435A5B]]></programlisting>
1580
1581<para>Notice how the block always ends with an unconditional jump
1582to the next block. This is a bit unnecessary, but makes many
1583things simpler.</para>
1584
1585<para>Most x86 instructions turn into sequences of
1586<computeroutput>GET</computeroutput>,
1587<computeroutput>PUT</computeroutput>,
1588<computeroutput>LEA1</computeroutput>,
1589<computeroutput>LEA2</computeroutput>,
1590<computeroutput>LOAD</computeroutput> and
1591<computeroutput>STORE</computeroutput>. Some complicated ones
1592however rely on calling helper bits of code in
1593<filename>vg_helpers.S</filename>. The ucode instructions
1594<computeroutput>PUSH</computeroutput>,
1595<computeroutput>POP</computeroutput>,
1596<computeroutput>CALL</computeroutput>,
1597<computeroutput>CALLM_S</computeroutput> and
1598<computeroutput>CALLM_E</computeroutput> support this. The
1599calling convention is somewhat ad-hoc and is not the C calling
1600convention. The helper routines must save all integer registers,
1601and the flags, that they use. Args are passed on the stack
1602underneath the return address, as usual, and if result(s) are to
1603be returned, it (they) are either placed in dummy arg slots
1604created by the ucode <computeroutput>PUSH</computeroutput>
1605sequence, or just overwrite the incoming args.</para>
1606
1607<para>In order that the instrumentation mechanism can handle
1608calls to these helpers,
1609<computeroutput>VG_(saneUCodeBlock)</computeroutput> enforces the
1610following restrictions on calls to helpers:</para>
1611
1612<itemizedlist>
1613
1614 <listitem>
1615 <para>Each <computeroutput>CALL</computeroutput> uinstr must
1616 be bracketed by a preceding
1617 <computeroutput>CALLM_S</computeroutput> marker (dummy
1618 uinstr) and a trailing
1619 <computeroutput>CALLM_E</computeroutput> marker. These
1620 markers are used by the instrumentation mechanism later to
1621 establish the boundaries of the
1622 <computeroutput>PUSH</computeroutput>,
1623 <computeroutput>POP</computeroutput> and
1624 <computeroutput>CLEAR</computeroutput> sequences for the
1625 call.</para>
1626 </listitem>
1627
1628 <listitem>
1629 <para><computeroutput>PUSH</computeroutput>,
1630 <computeroutput>POP</computeroutput> and
1631 <computeroutput>CLEAR</computeroutput> may only appear inside
1632 sections bracketed by
1633 <computeroutput>CALLM_S</computeroutput> and
1634 <computeroutput>CALLM_E</computeroutput>, and nowhere else.</para>
1635 </listitem>
1636
1637 <listitem>
1638 <para>In any such bracketed section, no two
1639 <computeroutput>PUSH</computeroutput> insns may push the same
1640 <computeroutput>TempReg</computeroutput>. Dually, no two two
1641 <computeroutput>POP</computeroutput>s may pop the same
1642 <computeroutput>TempReg</computeroutput>.</para>
1643 </listitem>
1644
1645 <listitem>
1646 <para>Finally, although this is not checked, args should be
1647 removed from the stack with
1648 <computeroutput>CLEAR</computeroutput>, rather than
1649 <computeroutput>POP</computeroutput>s into a
1650 <computeroutput>TempReg</computeroutput> which is not
1651 subsequently used. This is because the instrumentation
1652 mechanism assumes that all values
1653 <computeroutput>POP</computeroutput>ped from the stack are
1654 actually used.</para>
1655 </listitem>
1656
1657</itemizedlist>
1658
1659<para>Some of the translations may appear to have redundant
1660<computeroutput>TempReg</computeroutput>-to-<computeroutput>TempReg</computeroutput>
1661moves. This helps the next phase, UCode optimisation, to
1662generate better code.</para>
1663
1664</sect2>
1665
1666
1667
1668<sect2 id="mc-tech-docs.optim" xreflabel="UCode optimisation">
1669<title>UCode optimisation</title>
1670
1671<para>UCode is then subjected to an improvement pass
1672(<computeroutput>vg_improve()</computeroutput>), which blurs the
1673boundaries between the translations of the original x86
1674instructions. It's pretty straightforward. Three
1675transformations are done:</para>
1676
1677<itemizedlist>
1678
1679 <listitem>
1680 <para>Redundant <computeroutput>GET</computeroutput>
1681 elimination. Actually, more general than that -- eliminates
1682 redundant fetches of ArchRegs. In our running example,
1683 uinstr 3 <computeroutput>GET</computeroutput>s
1684 <computeroutput>%EDX</computeroutput> into
1685 <computeroutput>t2</computeroutput> despite the fact that, by
1686 looking at the previous uinstr, it is already in
1687 <computeroutput>t0</computeroutput>. The
1688 <computeroutput>GET</computeroutput> is therefore removed,
1689 and <computeroutput>t2</computeroutput> renamed to
1690 <computeroutput>t0</computeroutput>. Assuming
1691 <computeroutput>t0</computeroutput> is allocated to a host
1692 register, it means the simulated
1693 <computeroutput>%EDX</computeroutput> will exist in a host
1694 CPU register for more than one simulated x86 instruction,
1695 which seems to me to be a highly desirable property.</para>
1696
1697 <para>There is some mucking around to do with subregisters;
1698 <computeroutput>%AL</computeroutput> vs
1699 <computeroutput>%AH</computeroutput>
1700 <computeroutput>%AX</computeroutput> vs
1701 <computeroutput>%EAX</computeroutput> etc. I can't remember
1702 how it works, but in general we are very conservative, and
1703 these tend to invalidate the caching.</para>
1704 </listitem>
1705
1706 <listitem>
1707 <para>Redundant <computeroutput>PUT</computeroutput>
1708 elimination. This annuls
1709 <computeroutput>PUT</computeroutput>s of values back to
1710 simulated CPU registers if a later
1711 <computeroutput>PUT</computeroutput> would overwrite the
1712 earlier <computeroutput>PUT</computeroutput> value, and there
1713 is no intervening reads of the simulated register
1714 (<computeroutput>ArchReg</computeroutput>).</para>
1715
1716 <para>As before, we are paranoid when faced with subregister
1717 references. Also, <computeroutput>PUT</computeroutput>s of
1718 <computeroutput>%ESP</computeroutput> are never annulled,
1719 because it is vital the instrumenter always has an up-to-date
1720 <computeroutput>%ESP</computeroutput> value available,
1721 <computeroutput>%ESP</computeroutput> changes affect
1722 addressibility of the memory around the simulated stack
1723 pointer.</para>
1724
1725 <para>The implication of the above paragraph is that the
1726 simulated machine's registers are only lazily updated once
1727 the above two optimisation phases have run, with the
1728 exception of <computeroutput>%ESP</computeroutput>.
1729 <computeroutput>TempReg</computeroutput>s go dead at the end
1730 of every basic block, from which is is inferrable that any
1731 <computeroutput>TempReg</computeroutput> caching a simulated
1732 CPU reg is flushed (back into the relevant
1733 <computeroutput>VG_(baseBlock)</computeroutput> slot) at the
1734 end of every basic block. The further implication is that
1735 the simulated registers are only up-to-date at in between
1736 basic blocks, and not at arbitrary points inside basic
1737 blocks. And the consequence of that is that we can only
1738 deliver signals to the client in between basic blocks. None
1739 of this seems any problem in practice.</para>
1740 </listitem>
1741
1742 <listitem>
1743 <para>Finally there is a simple def-use thing for condition
1744 codes. If an earlier uinstr writes the condition codes, and
1745 the next uinsn along which actually cares about the condition
1746 codes writes the same or larger set of them, but does not
1747 read any, the earlier uinsn is marked as not writing any
1748 condition codes. This saves a lot of redundant cond-code
1749 saving and restoring.</para>
1750 </listitem>
1751
1752</itemizedlist>
1753
1754<para>The effect of these transformations on our short block is
1755rather unexciting, and shown below. On longer basic blocks they
1756can dramatically improve code quality.</para>
1757
1758<programlisting><![CDATA[
1759at 3: delete GET, rename t2 to t0 in (4 .. 6)
1760at 7: delete GET, rename t6 to t0 in (8 .. 9)
1761at 1: annul flag write OSZAP due to later OSZACP
1762
1763Improved code:
1764 0: GETL %EDX, t0
1765 1: INCL t0
1766 2: PUTL t0, %EDX
1767 4: LDB (t0), t0
1768 5: WIDENL_Bs t0
1769 6: PUTL t0, %EAX
1770 8: GETL %ECX, t8
1771 9: LEA2L 1(t8,t0,2), t4
1772 10: LDB (t4), t10
1773 11: MOVB $0x20, t12
1774 12: ANDB t12, t10 (-wOSZACP)
1775 13: INCEIPo $9
1776 14: Jnzo $0x40435A50 (-rOSZACP)
1777 15: JMPo $0x40435A5B]]></programlisting>
1778
1779</sect2>
1780
1781
1782
1783<sect2 id="mc-tech-docs.instrum" xreflabel="UCode instrumentation">
1784<title>UCode instrumentation</title>
1785
1786<para>Once you understand the meaning of the instrumentation
1787uinstrs, discussed in detail above, the instrumentation scheme is
1788fairly straightforward. Each uinstr is instrumented in
1789isolation, and the instrumentation uinstrs are placed before the
1790original uinstr. Our running example continues below. I have
1791placed a blank line after every original ucode, to make it easier
1792to see which instrumentation uinstrs correspond to which
1793originals.</para>
1794
1795<para>As mentioned somewhere above,
1796<computeroutput>TempReg</computeroutput>s carrying values have
1797names like <computeroutput>t28</computeroutput>, and each one has
1798a shadow carrying its V bits, with names like
1799<computeroutput>q28</computeroutput>. This pairing aids in
1800reading instrumented ucode.</para>
1801
1802<para>One decision about all this is where to have "observation
1803points", that is, where to check that V bits are valid. I use a
1804minimalistic scheme, only checking where a failure of validity
1805could cause the original program to (seg)fault. So the use of
1806values as memory addresses causes a check, as do conditional
1807jumps (these cause a check on the definedness of the condition
1808codes). And arguments <computeroutput>PUSH</computeroutput>ed
1809for helper calls are checked, hence the weird restrictions on
1810help call preambles described above.</para>
1811
1812<para>Another decision is that once a value is tested, it is
1813thereafter regarded as defined, so that we do not emit multiple
1814undefined-value errors for the same undefined value. That means
1815that <computeroutput>TESTV</computeroutput> uinstrs are always
1816followed by <computeroutput>SETV</computeroutput> on the same
1817(shadow) <computeroutput>TempReg</computeroutput>s. Most of
1818these <computeroutput>SETV</computeroutput>s are redundant and
1819are removed by the post-instrumentation cleanup phase.</para>
1820
1821<para>The instrumentation for calling helper functions deserves
1822further comment. The definedness of results from a helper is
1823modelled using just one V bit. So, in short, we do pessimising
1824casts of the definedness of all the args, down to a single bit,
1825and then <computeroutput>UifU</computeroutput> these bits
1826together. So this single V bit will say "undefined" if any part
1827of any arg is undefined. This V bit is then pessimally cast back
1828up to the result(s) sizes, as needed. If, by seeing that all the
1829args are got rid of with <computeroutput>CLEAR</computeroutput>
1830and none with <computeroutput>POP</computeroutput>, Valgrind sees
1831that the result of the call is not actually used, it immediately
1832examines the result V bit with a
1833<computeroutput>TESTV</computeroutput> --
1834<computeroutput>SETV</computeroutput> pair. If it did not do
1835this, there would be no observation point to detect that the some
1836of the args to the helper were undefined. Of course, if the
1837helper's results are indeed used, we don't do this, since the
1838result usage will presumably cause the result definedness to be
1839checked at some suitable future point.</para>
1840
1841<para>In general Valgrind tries to track definedness on a
1842bit-for-bit basis, but as the above para shows, for calls to
1843helpers we throw in the towel and approximate down to a single
1844bit. This is because it's too complex and difficult to track
1845bit-level definedness through complex ops such as integer
1846multiply and divide, and in any case there is no reasonable code
1847fragments which attempt to (eg) multiply two partially-defined
1848values and end up with something meaningful, so there seems
1849little point in modelling multiplies, divides, etc, in that level
1850of detail.</para>
1851
1852<para>Integer loads and stores are instrumented with firstly a
1853test of the definedness of the address, followed by a
1854<computeroutput>LOADV</computeroutput> or
1855<computeroutput>STOREV</computeroutput> respectively. These turn
1856into calls to (for example)
1857<computeroutput>VG_(helperc_LOADV4)</computeroutput>. These
1858helpers do two things: they perform an address-valid check, and
1859they load or store V bits from/to the relevant address in the
1860(simulated V-bit) memory.</para>
1861
1862<para>FPU loads and stores are different. As above the
1863definedness of the address is first tested. However, the helper
1864routine for FPU loads
1865(<computeroutput>VGM_(fpu_read_check)</computeroutput>) emits an
1866error if either the address is invalid or the referenced area
1867contains undefined values. It has to do this because we do not
1868simulate the FPU at all, and so cannot track definedness of
1869values loaded into it from memory, so we have to check them as
1870soon as they are loaded into the FPU, ie, at this point. We
1871notionally assume that everything in the FPU is defined.</para>
1872
1873<para>It follows therefore that FPU writes first check the
1874definedness of the address, then the validity of the address, and
1875finally mark the written bytes as well-defined.</para>
1876
1877<para>If anyone is inspired to extend Valgrind to MMX/SSE insns,
1878I suggest you use the same trick. It works provided that the
1879FPU/MMX unit is not used to merely as a conduit to copy partially
1880undefined data from one place in memory to another.
1881Unfortunately the integer CPU is used like that (when copying C
1882structs with holes, for example) and this is the cause of much of
1883the elaborateness of the instrumentation here described.</para>
1884
1885<para><computeroutput>vg_instrument()</computeroutput> in
1886<filename>vg_translate.c</filename> actually does the
1887instrumentation. There are comments explaining how each uinstr
1888is handled, so we do not repeat that here. As explained already,
1889it is bit-accurate, except for calls to helper functions.
1890Unfortunately the x86 insns
1891<computeroutput>bt/bts/btc/btr</computeroutput> are done by
1892helper fns, so bit-level accuracy is lost there. This should be
1893fixed by doing them inline; it will probably require adding a
1894couple new uinstrs. Also, left and right rotates through the
1895carry flag (x86 <computeroutput>rcl</computeroutput> and
1896<computeroutput>rcr</computeroutput>) are approximated via a
1897single V bit; so far this has not caused anyone to complain. The
1898non-carry rotates, <computeroutput>rol</computeroutput> and
1899<computeroutput>ror</computeroutput>, are much more common and
1900are done exactly. Re-visiting the instrumentation for AND and
1901OR, they seem rather verbose, and I wonder if it could be done
1902more concisely now.</para>
1903
1904<para>The lowercase <computeroutput>o</computeroutput> on many of
1905the uopcodes in the running example indicates that the size field
1906is zero, usually meaning a single-bit operation.</para>
1907
1908<para>Anyroads, the post-instrumented version of our running
1909example looks like this:</para>
1910
1911<programlisting><![CDATA[
1912Instrumented code:
1913 0: GETVL %EDX, q0
1914 1: GETL %EDX, t0
1915
1916 2: TAG1o q0 = Left4 ( q0 )
1917 3: INCL t0
1918
1919 4: PUTVL q0, %EDX
1920 5: PUTL t0, %EDX
1921
1922 6: TESTVL q0
1923 7: SETVL q0
1924 8: LOADVB (t0), q0
1925 9: LDB (t0), t0
1926
1927 10: TAG1o q0 = SWiden14 ( q0 )
1928 11: WIDENL_Bs t0
1929
1930 12: PUTVL q0, %EAX
1931 13: PUTL t0, %EAX
1932
1933 14: GETVL %ECX, q8
1934 15: GETL %ECX, t8
1935
1936 16: MOVL q0, q4
1937 17: SHLL $0x1, q4
1938 18: TAG2o q4 = UifU4 ( q8, q4 )
1939 19: TAG1o q4 = Left4 ( q4 )
1940 20: LEA2L 1(t8,t0,2), t4
1941
1942 21: TESTVL q4
1943 22: SETVL q4
1944 23: LOADVB (t4), q10
1945 24: LDB (t4), t10
1946
1947 25: SETVB q12
1948 26: MOVB $0x20, t12
1949
1950 27: MOVL q10, q14
1951 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
1952 29: TAG2o q10 = UifU1 ( q12, q10 )
1953 30: TAG2o q10 = DifD1 ( q14, q10 )
1954 31: MOVL q12, q14
1955 32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 )
1956 33: TAG2o q10 = DifD1 ( q14, q10 )
1957 34: MOVL q10, q16
1958 35: TAG1o q16 = PCast10 ( q16 )
1959 36: PUTVFo q16
1960 37: ANDB t12, t10 (-wOSZACP)
1961
1962 38: INCEIPo $9
1963
1964 39: GETVFo q18
1965 40: TESTVo q18
1966 41: SETVo q18
1967 42: Jnzo $0x40435A50 (-rOSZACP)
1968
1969 43: JMPo $0x40435A5B]]></programlisting>
1970
1971</sect2>
1972
1973
1974
1975<sect2 id="mc-tech-docs.cleanup"
1976 xreflabel="UCode post-instrumentation cleanup">
1977<title>UCode post-instrumentation cleanup</title>
1978
1979<para>This pass, coordinated by
1980<computeroutput>vg_cleanup()</computeroutput>, removes redundant
1981definedness computation created by the simplistic instrumentation
1982pass. It consists of two passes,
1983<computeroutput>vg_propagate_definedness()</computeroutput>
1984followed by
1985<computeroutput>vg_delete_redundant_SETVs</computeroutput>.</para>
1986
1987<para><computeroutput>vg_propagate_definedness()</computeroutput>
1988is a simple constant-propagation and constant-folding pass. It
1989tries to determine which
1990<computeroutput>TempReg</computeroutput>s containing V bits will
1991always indicate "fully defined", and it propagates this
1992information as far as it can, and folds out as many operations as
1993possible. For example, the instrumentation for an ADD of a
1994literal to a variable quantity will be reduced down so that the
1995definedness of the result is simply the definedness of the
1996variable quantity, since the literal is by definition fully
1997defined.</para>
1998
1999<para><computeroutput>vg_delete_redundant_SETVs</computeroutput>
2000removes <computeroutput>SETV</computeroutput>s on shadow
2001<computeroutput>TempReg</computeroutput>s for which the next
2002action is a write. I don't think there's anything else worth
2003saying about this; it is simple. Read the sources for
2004details.</para>
2005
2006<para>So the cleaned-up running example looks like this. As
2007above, I have inserted line breaks after every original
2008(non-instrumentation) uinstr to aid readability. As with
2009straightforward ucode optimisation, the results in this block are
2010undramatic because it is so short; longer blocks benefit more
2011because they have more redundancy which gets eliminated.</para>
2012
2013<programlisting><![CDATA[
2014at 29: delete UifU1 due to defd arg1
2015at 32: change ImproveAND1_TQ to MOV due to defd arg2
2016at 41: delete SETV
2017at 31: delete MOV
2018at 25: delete SETV
2019at 22: delete SETV
2020at 7: delete SETV
2021
2022 0: GETVL %EDX, q0
2023 1: GETL %EDX, t0
2024
2025 2: TAG1o q0 = Left4 ( q0 )
2026 3: INCL t0
2027
2028 4: PUTVL q0, %EDX
2029 5: PUTL t0, %EDX
2030
2031 6: TESTVL q0
2032 8: LOADVB (t0), q0
2033 9: LDB (t0), t0
2034
2035 10: TAG1o q0 = SWiden14 ( q0 )
2036 11: WIDENL_Bs t0
2037
2038 12: PUTVL q0, %EAX
2039 13: PUTL t0, %EAX
2040
2041 14: GETVL %ECX, q8
2042 15: GETL %ECX, t8
2043
2044 16: MOVL q0, q4
2045 17: SHLL $0x1, q4
2046 18: TAG2o q4 = UifU4 ( q8, q4 )
2047 19: TAG1o q4 = Left4 ( q4 )
2048 20: LEA2L 1(t8,t0,2), t4
2049
2050 21: TESTVL q4
2051 23: LOADVB (t4), q10
2052 24: LDB (t4), t10
2053
2054 26: MOVB $0x20, t12
2055
2056 27: MOVL q10, q14
2057 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
2058 30: TAG2o q10 = DifD1 ( q14, q10 )
2059 32: MOVL t12, q14
2060 33: TAG2o q10 = DifD1 ( q14, q10 )
2061 34: MOVL q10, q16
2062 35: TAG1o q16 = PCast10 ( q16 )
2063 36: PUTVFo q16
2064 37: ANDB t12, t10 (-wOSZACP)
2065
2066 38: INCEIPo $9
2067 39: GETVFo q18
2068 40: TESTVo q18
2069 42: Jnzo $0x40435A50 (-rOSZACP)
2070
2071 43: JMPo $0x40435A5B]]></programlisting>
2072
2073</sect2>
2074
2075
2076
2077<sect2 id="mc-tech-docs.transfrom" xreflabel="Translation from UCode">
2078<title>Translation from UCode</title>
2079
2080<para>This is all very simple, even though
2081<filename>vg_from_ucode.c</filename> is a big file.
2082Position-independent x86 code is generated into a dynamically
2083allocated array <computeroutput>emitted_code</computeroutput>;
2084this is doubled in size when it overflows. Eventually the array
2085is handed back to the caller of
2086<computeroutput>VG_(translate)</computeroutput>, who must copy
2087the result into TC and TT, and free the array.</para>
2088
2089<para>This file is structured into four layers of abstraction,
2090which, thankfully, are glued back together with extensive
2091<computeroutput>__inline__</computeroutput> directives. From the
2092bottom upwards:</para>
2093
2094<itemizedlist>
2095
2096 <listitem>
2097 <para>Address-mode emitters,
2098 <computeroutput>emit_amode_regmem_reg</computeroutput> et
2099 al.</para>
2100 </listitem>
2101
2102 <listitem>
2103 <para>Emitters for specific x86 instructions. There are
2104 quite a lot of these, with names such as
2105 <computeroutput>emit_movv_offregmem_reg</computeroutput>.
2106 The <computeroutput>v</computeroutput> suffix is Intel
2107 parlance for a 16/32 bit insn; there are also
2108 <computeroutput>b</computeroutput> suffixes for 8 bit
2109 insns.</para>
2110 </listitem>
2111
2112 <listitem>
2113 <para>The next level up are the
2114 <computeroutput>synth_*</computeroutput> functions, which
2115 synthesise possibly a sequence of raw x86 instructions to do
2116 some simple task. Some of these are quite complex because
2117 they have to work around Intel's silly restrictions on
2118 subregister naming. See
2119 <computeroutput>synth_nonshiftop_reg_reg</computeroutput> for
2120 example.</para>
2121 </listitem>
2122
2123 <listitem>
2124 <para>Finally, at the top of the heap, we have
2125 <computeroutput>emitUInstr()</computeroutput>, which emits
2126 code for a single uinstr.</para>
2127 </listitem>
2128
2129</itemizedlist>
2130
2131<para>Some comments:</para>
2132
2133<itemizedlist>
2134
2135 <listitem>
2136 <para>The hack for FPU instructions becomes apparent here.
2137 To do a <computeroutput>FPU</computeroutput> ucode
2138 instruction, we load the simulated FPU's state into from its
2139 <computeroutput>VG_(baseBlock)</computeroutput> into the real
2140 FPU using an x86 <computeroutput>frstor</computeroutput>
2141 insn, do the ucode <computeroutput>FPU</computeroutput> insn
2142 on the real CPU, and write the updated FPU state back into
2143 <computeroutput>VG_(baseBlock)</computeroutput> using an
2144 <computeroutput>fnsave</computeroutput> instruction. This is
2145 pretty brutal, but is simple and it works, and even seems
2146 tolerably efficient. There is no attempt to cache the
2147 simulated FPU state in the real FPU over multiple
2148 back-to-back ucode FPU instructions.</para>
2149
2150 <para><computeroutput>FPU_R</computeroutput> and
2151 <computeroutput>FPU_W</computeroutput> are also done this
2152 way, with the minor complication that we need to patch in
2153 some addressing mode bits so the resulting insn knows the
2154 effective address to use. This is easy because of the
2155 regularity of the x86 FPU instruction encodings.</para>
2156 </listitem>
2157
2158 <listitem>
2159 <para>An analogous trick is done with ucode insns which
2160 claim, in their <computeroutput>flags_r</computeroutput> and
2161 <computeroutput>flags_w</computeroutput> fields, that they
2162 read or write the simulated
2163 <computeroutput>%EFLAGS</computeroutput>. For such cases we
2164 first copy the simulated
2165 <computeroutput>%EFLAGS</computeroutput> into the real
2166 <computeroutput>%eflags</computeroutput>, then do the insn,
2167 then, if the insn says it writes the flags, copy back to
2168 <computeroutput>%EFLAGS</computeroutput>. This is a bit
2169 expensive, which is why the ucode optimisation pass goes to
2170 some effort to remove redundant flag-update annotations.</para>
2171 </listitem>
2172
2173</itemizedlist>
2174
2175<para>And so ... that's the end of the documentation for the
2176instrumentating translator! It's really not that complex,
2177because it's composed as a sequence of simple(ish) self-contained
2178transformations on straight-line blocks of code.</para>
2179
2180</sect2>
2181
2182
2183
2184<sect2 id="mc-tech-docs.dispatch" xreflabel="Top-level dispatch loop">
2185<title>Top-level dispatch loop</title>
2186
2187<para>Urk. In <computeroutput>VG_(toploop)</computeroutput>.
2188This is basically boring and unsurprising, not to mention fiddly
2189and fragile. It needs to be cleaned up.</para>
2190
2191<para>The only perhaps surprise is that the whole thing is run on
2192top of a <computeroutput>setjmp</computeroutput>-installed
2193exception handler, because, supposing a translation got a
2194segfault, we have to bail out of the Valgrind-supplied exception
2195handler <computeroutput>VG_(oursignalhandler)</computeroutput>
2196and immediately start running the client's segfault handler, if
2197it has one. In particular we can't finish the current basic
2198block and then deliver the signal at some convenient future
2199point, because signals like SIGILL, SIGSEGV and SIGBUS mean that
2200the faulting insn should not simply be re-tried. (I'm sure there
2201is a clearer way to explain this).</para>
2202
2203</sect2>
2204
2205
2206
2207<sect2 id="mc-tech-docs.lazy"
2208 xreflabel="Lazy updates of the simulated program counter">
2209<title>Lazy updates of the simulated program counter</title>
2210
2211<para>Simulated <computeroutput>%EIP</computeroutput> is not
2212updated after every simulated x86 insn as this was regarded as
2213too expensive. Instead ucode
2214<computeroutput>INCEIP</computeroutput> insns move it along as
2215and when necessary. Currently we don't allow it to fall more
2216than 4 bytes behind reality (see
2217<computeroutput>VG_(disBB)</computeroutput> for the way this
2218works).</para>
2219
2220<para>Note that <computeroutput>%EIP</computeroutput> is always
2221brought up to date by the inner dispatch loop in
2222<computeroutput>VG_(dispatch)</computeroutput>, so that if the
2223client takes a fault we know at least which basic block this
2224happened in.</para>
2225
2226</sect2>
2227
2228
2229
2230<sect2 id="mc-tech-docs.signals" xreflabel="Signals">
2231<title>Signals</title>
2232
2233<para>Horrible, horrible. <filename>vg_signals.c</filename>.
2234Basically, since we have to intercept all system calls anyway, we
2235can see when the client tries to install a signal handler. If it
2236does so, we make a note of what the client asked to happen, and
2237ask the kernel to route the signal to our own signal handler,
2238<computeroutput>VG_(oursignalhandler)</computeroutput>. This
2239simply notes the delivery of signals, and returns.</para>
2240
2241<para>Every 1000 basic blocks, we see if more signals have
2242arrived. If so,
2243<computeroutput>VG_(deliver_signals)</computeroutput> builds
2244signal delivery frames on the client's stack, and allows their
2245handlers to be run. Valgrind places in these signal delivery
2246frames a bogus return address,
2247<computeroutput>VG_(signalreturn_bogusRA)</computeroutput>, and
2248checks all jumps to see if any jump to it. If so, this is a sign
2249that a signal handler is returning, and if so Valgrind removes
2250the relevant signal frame from the client's stack, restores the
2251from the signal frame the simulated state before the signal was
2252delivered, and allows the client to run onwards. We have to do
2253it this way because some signal handlers never return, they just
2254<computeroutput>longjmp()</computeroutput>, which nukes the
2255signal delivery frame.</para>
2256
2257<para>The Linux kernel has a different but equally horrible hack
2258for detecting signal handler returns. Discovering it is left as
2259an exercise for the reader.</para>
2260
2261</sect2>
2262
2263
2264<sect2 id="mc-tech-docs.todo">
2265<title>To be written</title>
2266
2267<para>The following is a list of as-yet-not-written stuff. Apologies.</para>
2268<orderedlist>
2269 <listitem>
2270 <para>The translation cache and translation table</para>
2271 </listitem>
2272 <listitem>
2273 <para>Exceptions, creating new translations</para>
2274 </listitem>
2275 <listitem>
2276 <para>Self-modifying code</para>
2277 </listitem>
2278 <listitem>
2279 <para>Errors, error contexts, error reporting, suppressions</para>
2280 </listitem>
2281 <listitem>
2282 <para>Client malloc/free</para>
2283 </listitem>
2284 <listitem>
2285 <para>Low-level memory management</para>
2286 </listitem>
2287 <listitem>
2288 <para>A and V bitmaps</para>
2289 </listitem>
2290 <listitem>
2291 <para>Symbol table management</para>
2292 </listitem>
2293 <listitem>
2294 <para>Dealing with system calls</para>
2295 </listitem>
2296 <listitem>
2297 <para>Namespace management</para>
2298 </listitem>
2299 <listitem>
2300 <para>GDB attaching</para>
2301 </listitem>
2302 <listitem>
2303 <para>Non-dependence on glibc or anything else</para>
2304 </listitem>
2305 <listitem>
2306 <para>The leak detector</para>
2307 </listitem>
2308 <listitem>
2309 <para>Performance problems</para>
2310 </listitem>
2311 <listitem>
2312 <para>Continuous sanity checking</para>
2313 </listitem>
2314 <listitem>
2315 <para>Tracing, or not tracing, child processes</para>
2316 </listitem>
2317 <listitem>
2318 <para>Assembly glue for syscalls</para>
2319 </listitem>
2320</orderedlist>
2321
2322</sect2>
2323
2324</sect1>
2325
2326
2327
2328
2329<sect1 id="mc-tech-docs.extensions" xreflabel="Extensions">
2330<title>Extensions</title>
2331
2332<para>Some comments about Stuff To Do.</para>
2333
2334<sect2 id="mc-tech-docs.bugs" xreflabel="Bugs">
2335<title>Bugs</title>
2336
2337<para>Stephan Kulow and Marc Mutz report problems with kmail in
2338KDE 3 CVS (RC2 ish) when run on Valgrind. Stephan has it
2339deadlocking; Marc has it looping at startup. I can't repro
2340either behaviour. Needs repro-ing and fixing.</para>
2341
2342</sect2>
2343
2344
2345<sect2 id="mc-tech-docs.threads" xreflabel="Threads">
2346<title>Threads</title>
2347
2348<para>Doing a good job of thread support strikes me as almost a
2349research-level problem. The central issues are how to do fast
2350cheap locking of the
2351<computeroutput>VG_(primary_map)</computeroutput> structure,
2352whether or not accesses to the individual secondary maps need
2353locking, what race-condition issues result, and whether the
2354already-nasty mess that is the signal simulator needs further
2355hackery.</para>
2356
2357<para>I realise that threads are the most-frequently-requested
2358feature, and I am thinking about it all. If you have guru-level
2359understanding of fast mutual exclusion mechanisms and race
2360conditions, I would be interested in hearing from you.</para>
2361
2362</sect2>
2363
2364
2365
2366<sect2 id="mc-tech-docs.verify" xreflabel="Verification suite">
2367<title>Verification suite</title>
2368
2369<para>Directory <computeroutput>tests/</computeroutput> contains
2370various ad-hoc tests for Valgrind. However, there is no
2371systematic verification or regression suite, that, for example,
2372exercises all the stuff in <filename>vg_memory.c</filename>, to
2373ensure that illegal memory accesses and undefined value uses are
2374detected as they should be. It would be good to have such a
2375suite.</para>
2376
2377</sect2>
2378
2379
2380<sect2 id="mc-tech-docs.porting" xreflabel="Porting to other platforms">
2381<title>Porting to other platforms</title>
2382
2383<para>It would be great if Valgrind was ported to FreeBSD and x86
2384NetBSD, and to x86 OpenBSD, if it's possible (doesn't OpenBSD use
2385a.out-style executables, not ELF ?)</para>
2386
2387<para>The main difficulties, for an x86-ELF platform, seem to
2388be:</para>
2389
2390<itemizedlist>
2391
2392 <listitem>
2393 <para>You'd need to rewrite the
2394 <computeroutput>/proc/self/maps</computeroutput> parser
2395 (<filename>vg_procselfmaps.c</filename>). Easy.</para>
2396 </listitem>
2397
2398 <listitem>
2399 <para>You'd need to rewrite
2400 <filename>vg_syscall_mem.c</filename>, or, more specifically,
2401 provide one for your OS. This is tedious, but you can
2402 implement syscalls on demand, and the Linux kernel interface
2403 is, for the most part, going to look very similar to the *BSD
2404 interfaces, so it's really a copy-paste-and-modify-on-demand
2405 job. As part of this, you'd need to supply a new
2406 <filename>vg_kerneliface.h</filename> file.</para>
2407 </listitem>
2408
2409 <listitem>
2410 <para>You'd also need to change the syscall wrappers for
2411 Valgrind's internal use, in
2412 <filename>vg_mylibc.c</filename>.</para>
2413 </listitem>
2414
2415</itemizedlist>
2416
2417<para>All in all, I think a port to x86-ELF *BSDs is not really
2418very difficult, and in some ways I would like to see it happen,
2419because that would force a more clear factoring of Valgrind into
2420platform dependent and independent pieces. Not to mention, *BSD
2421folks also deserve to use Valgrind just as much as the Linux crew
2422do.</para>
2423
2424</sect2>
2425
2426</sect1>
2427
2428
2429
2430<sect1 id="mc-tech-docs.easystuff"
2431 xreflabel="Easy stuff which ought to be done">
2432<title>Easy stuff which ought to be done</title>
2433
2434
2435<sect2 id="mc-tech-docs.mmx" xreflabel="MMX Instructions">
2436<title>MMX Instructions</title>
2437
2438<para>MMX insns should be supported, using the same trick as for
2439FPU insns. If the MMX registers are not used to copy
2440uninitialised junk from one place to another in memory, this
2441means we don't have to actually simulate the internal MMX unit
2442state, so the FPU hack applies. This should be fairly
2443easy.</para>
2444
2445</sect2>
2446
2447
2448<sect2 id="mc-tech-docs.fixstabs" xreflabel="Fix stabs-info Reader">
2449<title>Fix stabs-info reader</title>
2450
2451<para>The machinery in <filename>vg_symtab2.c</filename> which
2452reads "stabs" style debugging info is pretty weak. It usually
2453correctly translates simulated program counter values into line
2454numbers and procedure names, but the file name is often
2455completely wrong. I think the logic used to parse "stabs"
2456entries is weak. It should be fixed. The simplest solution,
2457IMO, is to copy either the logic or simply the code out of GNU
2458binutils which does this; since GDB can clearly get it right,
2459binutils (or GDB?) must have code to do this somewhere.</para>
2460
2461</sect2>
2462
2463
2464
2465<sect2 id="mc-tech-docs.x86instr" xreflabel="BT/BTC/BTS/BTR">
2466<title>BT/BTC/BTS/BTR</title>
2467
2468<para>These are x86 instructions which test, complement, set, or
2469reset, a single bit in a word. At the moment they are both
2470incorrectly implemented and incorrectly instrumented.</para>
2471
2472<para>The incorrect instrumentation is due to use of helper
2473functions. This means we lose bit-level definedness tracking,
2474which could wind up giving spurious uninitialised-value use
2475errors. The Right Thing to do is to invent a couple of new
2476UOpcodes, I think <computeroutput>GET_BIT</computeroutput> and
2477<computeroutput>SET_BIT</computeroutput>, which can be used to
2478implement all 4 x86 insns, get rid of the helpers, and give
2479bit-accurate instrumentation rules for the two new
2480UOpcodes.</para>
2481
2482<para>I realised the other day that they are mis-implemented too.
2483The x86 insns take a bit-index and a register or memory location
2484to access. For registers the bit index clearly can only be in
2485the range zero to register-width minus 1, and I assumed the same
2486applied to memory locations too. But evidently not; for memory
2487locations the index can be arbitrary, and the processor will
2488index arbitrarily into memory as a result. This too should be
2489fixed. Sigh. Presumably indexing outside the immediate word is
2490not actually used by any programs yet tested on Valgrind, for
2491otherwise they (presumably) would simply not work at all. If you
2492plan to hack on this, first check the Intel docs to make sure my
2493understanding is really correct.</para>
2494
2495</sect2>
2496
2497
2498<sect2 id="mc-tech-docs.prefetch" xreflabel="Using PREFETCH Instructions">
2499<title>Using PREFETCH Instructions</title>
2500
2501<para>Here's a small but potentially interesting project for
2502performance junkies. Experiments with valgrind's code generator
2503and optimiser(s) suggest that reducing the number of instructions
2504executed in the translations and mem-check helpers gives
2505disappointingly small performance improvements. Perhaps this is
2506because performance of Valgrindified code is limited by cache
2507misses. After all, each read in the original program now gives
2508rise to at least three reads, one for the
2509<computeroutput>VG_(primary_map)</computeroutput>, one of the
2510resulting secondary, and the original. Not to mention, the
2511instrumented translations are 13 to 14 times larger than the
2512originals. All in all one would expect the memory system to be
2513hammered to hell and then some.</para>
2514
2515<para>So here's an idea. An x86 insn involving a read from
2516memory, after instrumentation, will turn into ucode of the
2517following form:</para>
2518<programlisting><![CDATA[
2519... calculate effective addr, into ta and qa ...
2520 TESTVL qa -- is the addr defined?
2521 LOADV (ta), qloaded -- fetch V bits for the addr
2522 LOAD (ta), tloaded -- do the original load]]></programlisting>
2523
2524<para>At the point where the
2525<computeroutput>LOADV</computeroutput> is done, we know the
2526actual address (<computeroutput>ta</computeroutput>) from which
2527the real <computeroutput>LOAD</computeroutput> will be done. We
2528also know that the <computeroutput>LOADV</computeroutput> will
2529take around 20 x86 insns to do. So it seems plausible that doing
2530a prefetch of <computeroutput>ta</computeroutput> just before the
2531<computeroutput>LOADV</computeroutput> might just avoid a miss at
2532the <computeroutput>LOAD</computeroutput> point, and that might
2533be a significant performance win.</para>
2534
2535<para>Prefetch insns are notoriously tempermental, more often
2536than not making things worse rather than better, so this would
2537require considerable fiddling around. It's complicated because
2538Intels and AMDs have different prefetch insns with different
2539semantics, so that too needs to be taken into account. As a
2540general rule, even placing the prefetches before the
2541<computeroutput>LOADV</computeroutput> insn is too near the
2542<computeroutput>LOAD</computeroutput>; the ideal distance is
2543apparently circa 200 CPU cycles. So it might be worth having
2544another analysis/transformation pass which pushes prefetches as
2545far back as possible, hopefully immediately after the effective
2546address becomes available.</para>
2547
2548<para>Doing too many prefetches is also bad because they soak up
2549bus bandwidth / cpu resources, so some cleverness in deciding
2550which loads to prefetch and which to not might be helpful. One
2551can imagine not prefetching client-stack-relative
2552(<computeroutput>%EBP</computeroutput> or
2553<computeroutput>%ESP</computeroutput>) accesses, since the stack
2554in general tends to show good locality anyway.</para>
2555
2556<para>There's quite a lot of experimentation to do here, but I
2557think it might make an interesting week's work for
2558someone.</para>
2559
2560<para>As of 15-ish March 2002, I've started to experiment with
2561this, using the AMD
2562<computeroutput>prefetch/prefetchw</computeroutput> insns.</para>
2563
2564</sect2>
2565
2566
2567<sect2 id="mc-tech-docs.pranges" xreflabel="User-defined Permission Ranges">
2568<title>User-defined Permission Ranges</title>
2569
2570<para>This is quite a large project -- perhaps a month's hacking
2571for a capable hacker to do a good job -- but it's potentially
2572very interesting. The outcome would be that Valgrind could
2573detect a whole class of bugs which it currently cannot.</para>
2574
2575<para>The presentation falls into two pieces.</para>
2576
2577<sect3 id="mc-tech-docs.psetting"
2578 xreflabel="Part 1: User-defined Address-range Permission Setting">
2579<title>Part 1: User-defined Address-range Permission Setting</title>
2580
2581<para>Valgrind intercepts the client's
2582<computeroutput>malloc</computeroutput>,
2583<computeroutput>free</computeroutput>, etc calls, watches system
2584calls, and watches the stack pointer move. This is currently the
2585only way it knows about which addresses are valid and which not.
2586Sometimes the client program knows extra information about its
2587memory areas. For example, the client could at some point know
2588that all elements of an array are out-of-date. We would like to
2589be able to convey to Valgrind this information that the array is
2590now addressable-but-uninitialised, so that Valgrind can then warn
2591if elements are used before they get new values.</para>
2592
2593<para>What I would like are some macros like this:</para>
2594<programlisting><![CDATA[
2595 VALGRIND_MAKE_NOACCESS(addr, len)
2596 VALGRIND_MAKE_WRITABLE(addr, len)
2597 VALGRIND_MAKE_READABLE(addr, len)]]></programlisting>
2598
2599<para>and also, to check that memory is
2600addressible/initialised,</para>
2601<programlisting><![CDATA[
2602 VALGRIND_CHECK_ADDRESSIBLE(addr, len)
2603 VALGRIND_CHECK_INITIALISED(addr, len)]]></programlisting>
2604
2605<para>I then include in my sources a header defining these
2606macros, rebuild my app, run under Valgrind, and get user-defined
2607checks.</para>
2608
2609<para>Now here's a neat trick. It's a nuisance to have to
2610re-link the app with some new library which implements the above
2611macros. So the idea is to define the macros so that the
2612resulting executable is still completely stand-alone, and can be
2613run without Valgrind, in which case the macros do nothing, but
2614when run on Valgrind, the Right Thing happens. How to do this?
2615The idea is for these macros to turn into a piece of inline
2616assembly code, which (1) has no effect when run on the real CPU,
2617(2) is easily spotted by Valgrind's JITter, and (3) no sane
2618person would ever write, which is important for avoiding false
2619matches in (2). So here's a suggestion:</para>
2620<programlisting><![CDATA[
2621 VALGRIND_MAKE_NOACCESS(addr, len)]]></programlisting>
2622
2623<para>becomes (roughly speaking)</para>
2624<programlisting><![CDATA[
2625 movl addr, %eax
2626 movl len, %ebx
2627 movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be
2628 -- 2, etc
2629 rorl $13, %ecx
2630 rorl $19, %ecx
2631 rorl $11, %eax
2632 rorl $21, %eax]]></programlisting>
2633
2634<para>The rotate sequences have no effect, and it's unlikely they
2635would appear for any other reason, but they define a unique
2636byte-sequence which the JITter can easily spot. Using the
2637operand constraints section at the end of a gcc inline-assembly
2638statement, we can tell gcc that the assembly fragment kills
2639<computeroutput>%eax</computeroutput>,
2640<computeroutput>%ebx</computeroutput>,
2641<computeroutput>%ecx</computeroutput> and the condition codes, so
2642this fragment is made harmless when not running on Valgrind, runs
2643quickly when not on Valgrind, and does not require any other
2644library support.</para>
2645
2646
2647</sect3>
2648
2649
2650<sect3 id="mc-tech-docs.prange-detect"
2651 xreflabel="Part 2: Using it to detect Interference between Stack
2652Variables">
2653<title>Part 2: Using it to detect Interference between Stack
2654Variables</title>
2655
2656<para>Currently Valgrind cannot detect errors of the following
2657form:</para>
2658<programlisting><![CDATA[
2659void fooble ( void )
2660{
2661 int a[10];
2662 int b[10];
2663 a[10] = 99;
2664}]]></programlisting>
2665
2666<para>Now imagine rewriting this as</para>
2667<programlisting><![CDATA[
2668void fooble ( void )
2669{
2670 int spacer0;
2671 int a[10];
2672 int spacer1;
2673 int b[10];
2674 int spacer2;
2675 VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
2676 VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
2677 VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
2678 a[10] = 99;
2679}]]></programlisting>
2680
2681<para>Now the invalid write is certain to hit
2682<computeroutput>spacer0</computeroutput> or
2683<computeroutput>spacer1</computeroutput>, so Valgrind will spot
2684the error.</para>
2685
2686<para>There are two complications.</para>
2687
2688<orderedlist>
2689
2690 <listitem>
2691 <para>The first is that we don't want to annotate sources by
2692 hand, so the Right Thing to do is to write a C/C++ parser,
2693 annotator, prettyprinter which does this automatically, and
2694 run it on post-CPP'd C/C++ source. See
2695 http://www.cacheprof.org for an example of a system which
2696 transparently inserts another phase into the gcc/g++
2697 compilation route. The parser/prettyprinter is probably not
2698 as hard as it sounds; I would write it in Haskell, a powerful
2699 functional language well suited to doing symbolic
2700 computation, with which I am intimately familar. There is
2701 already a C parser written in Haskell by someone in the
2702 Haskell community, and that would probably be a good starting
2703 point.</para>
2704 </listitem>
2705
2706
2707 <listitem>
2708 <para>The second complication is how to get rid of these
2709 <computeroutput>NOACCESS</computeroutput> records inside
2710 Valgrind when the instrumented function exits; after all,
2711 these refer to stack addresses and will make no sense
2712 whatever when some other function happens to re-use the same
2713 stack address range, probably shortly afterwards. I think I
2714 would be inclined to define a special stack-specific
2715 macro:</para>
2716<programlisting><![CDATA[
2717 VALGRIND_MAKE_NOACCESS_STACK(addr, len)]]></programlisting>
2718 <para>which causes Valgrind to record the client's
2719 <computeroutput>%ESP</computeroutput> at the time it is
2720 executed. Valgrind will then watch for changes in
2721 <computeroutput>%ESP</computeroutput> and discard such
2722 records as soon as the protected area is uncovered by an
2723 increase in <computeroutput>%ESP</computeroutput>. I
2724 hesitate with this scheme only because it is potentially
2725 expensive, if there are hundreds of such records, and
2726 considering that changes in
2727 <computeroutput>%ESP</computeroutput> already require
2728 expensive messing with stack access permissions.</para>
2729 </listitem>
2730</orderedlist>
2731
2732<para>This is probably easier and more robust than for the
2733instrumenter program to try and spot all exit points for the
2734procedure and place suitable deallocation annotations there.
2735Plus C++ procedures can bomb out at any point if they get an
2736exception, so spotting return points at the source level just
2737won't work at all.</para>
2738
2739<para>Although some work, it's all eminently doable, and it would
2740make Valgrind into an even-more-useful tool.</para>
2741
2742</sect3>
2743
2744</sect2>
2745
2746</sect1>
2747</chapter>