Blame - memcheck/docs/mc-tech-docs.xml - fp2-dev/platform/external/valgrind

blob: 492902c158e2e1ebd26f3ca152ad2435fef90c57 [file] [log] [blame]

njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
				4
				5	<chapter id="mc-tech-docs"
				6	xreflabel="The design and implementation of Valgrind">
				7
				8	<title>The Design and Implementation of Valgrind</title>
				9	<subtitle>Detailed technical notes for hackers, maintainers and
				10	the overly-curious</subtitle>
				11
				12	<sect1 id="mc-tech-docs.intro" xreflabel="Introduction">
				13	<title>Introduction</title>
				14
				15	<para>This document contains a detailed, highly-technical
				16	description of the internals of Valgrind. This is not the user
				17	manual; if you are an end-user of Valgrind, you do not want to
				18	read this. Conversely, if you really are a hacker-type and want
				19	to know how it works, I assume that you have read the user manual
				20	thoroughly.</para>
				21
				22	<para>You may need to read this document several times, and
				23	carefully. Some important things, I only say once.</para>
				24
				25
				26
				27
				28	<sect2 id="mc-tech-docs.history" xreflabel="History">
				29	<title>History</title>
				30
				31	<para>Valgrind came into public view in late Feb 2002. However,
				32	it has been under contemplation for a very long time, perhaps
				33	seriously for about five years. Somewhat over two years ago, I
				34	started working on the x86 code generator for the Glasgow Haskell
				35	Compiler (http://www.haskell.org/ghc), gaining familiarity with
				36	x86 internals on the way. I then did Cacheprof
				37	(http://www.cacheprof.org), gaining further x86 experience. Some
				38	time around Feb 2000 I started experimenting with a user-space
				39	x86 interpreter for x86-Linux. This worked, but it was clear
				40	that a JIT-based scheme would be necessary to give reasonable
				41	performance for Valgrind. Design work for the JITter started in
				42	earnest in Oct 2000, and by early 2001 I had an x86-to-x86
				43	dynamic translator which could run quite large programs. This
				44	translator was in a sense pointless, since it did not do any
				45	instrumentation or checking.</para>
				46
				47	<para>Most of the rest of 2001 was taken up designing and
				48	implementing the instrumentation scheme. The main difficulty,
				49	which consumed a lot of effort, was to design a scheme which did
				50	not generate large numbers of false uninitialised-value warnings.
				51	By late 2001 a satisfactory scheme had been arrived at, and I
				52	started to test it on ever-larger programs, with an eventual eye
				53	to making it work well enough so that it was helpful to folks
				54	debugging the upcoming version 3 of KDE. I've used KDE since
				55	before version 1.0, and wanted to Valgrind to be an indirect
				56	contribution to the KDE 3 development effort. At the start of
				57	Feb 02 the kde-core-devel crew started using it, and gave a huge
				58	amount of helpful feedback and patches in the space of three
				59	weeks. Snapshot 20020306 is the result.</para>
				60
				61	<para>In the best Unix tradition, or perhaps in the spirit of
				62	Fred Brooks' depressing-but-completely-accurate epitaph "build
				63	one to throw away; you will anyway", much of Valgrind is a second
				64	or third rendition of the initial idea. The instrumentation
				65	machinery (<filename>vg_translate.c</filename>,
				66	<filename>vg_memory.c</filename>) and core CPU simulation
				67	(<filename>vg_to_ucode.c</filename>,
				68	<filename>vg_from_ucode.c</filename>) have had three redesigns
				69	and rewrites; the register allocator, low-level memory manager
				70	(<filename>vg_malloc2.c</filename>) and symbol table reader
				71	(<filename>vg_symtab2.c</filename>) are on the second rewrite.
				72	In a sense, this document serves to record some of the knowledge
				73	gained as a result.</para>
				74
				75	</sect2>
				76
				77
				78	<sect2 id="mc-tech-docs.overview" xreflabel="Design overview">
				79	<title>Design overview</title>
				80
				81	<para>Valgrind is compiled into a Linux shared object,
				82	<filename>valgrind.so</filename>, and also a dummy one,
				83	<filename>valgrinq.so</filename>, of which more later. The
				84	<filename>valgrind</filename> shell script adds
				85	<filename>valgrind.so</filename> to the
				86	<computeroutput>LD_PRELOAD</computeroutput> list of extra
				87	libraries to be loaded with any dynamically linked library. This
				88	is a standard trick, one which I assume the
				89	<computeroutput>LD_PRELOAD</computeroutput> mechanism was
				90	developed to support.</para>
				91
				92	<para><filename>valgrind.so</filename> is linked with the
				93	<computeroutput>-z initfirst</computeroutput> flag, which
				94	requests that its initialisation code is run before that of any
				95	other object in the executable image. When this happens,
				96	valgrind gains control. The real CPU becomes "trapped" in
				97	<filename>valgrind.so</filename> and the translations it
				98	generates. The synthetic CPU provided by Valgrind does, however,
				99	return from this initialisation function. So the normal startup
				100	actions, orchestrated by the dynamic linker
				101	<filename>ld.so</filename>, continue as usual, except on the
				102	synthetic CPU, not the real one. Eventually
				103	<computeroutput>main</computeroutput> is run and returns, and
				104	then the finalisation code of the shared objects is run,
				105	presumably in inverse order to which they were initialised.
				106	Remember, this is still all happening on the simulated CPU.
				107	Eventually <filename>valgrind.so</filename>'s own finalisation
				108	code is called. It spots this event, shuts down the simulated
				109	CPU, prints any error summaries and/or does leak detection, and
				110	returns from the initialisation code on the real CPU. At this
				111	point, in effect the real and synthetic CPUs have merged back
				112	into one, Valgrind has lost control of the program, and the
				113	program finally <computeroutput>exit()s</computeroutput> back to
				114	the kernel in the usual way.</para>
				115
				116	<para>The normal course of activity, once Valgrind has started
				117	up, is as follows. Valgrind never runs any part of your program
				118	(usually referred to as the "client"), not a single byte of it,
				119	directly. Instead it uses function
				120	<computeroutput>VG_(translate)</computeroutput> to translate
				121	basic blocks (BBs, straight-line sequences of code) into
				122	instrumented translations, and those are run instead. The
				123	translations are stored in the translation cache (TC),
				124	<computeroutput>vg_tc</computeroutput>, with the translation
				125	table (TT), <computeroutput>vg_tt</computeroutput> supplying the
				126	original-to-translation code address mapping. Auxiliary array
				127	<computeroutput>VG_(tt_fast)</computeroutput> is used as a
				128	direct-map cache for fast lookups in TT; it usually achieves a
				129	hit rate of around 98% and facilitates an orig-to-trans lookup in
				130	4 x86 insns, which is not bad.</para>
				131
				132	<para>Function <computeroutput>VG_(dispatch)</computeroutput> in
				133	<filename>vg_dispatch.S</filename> is the heart of the JIT
				134	dispatcher. Once a translated code address has been found, it is
				135	executed simply by an x86 <computeroutput>call</computeroutput>
				136	to the translation. At the end of the translation, the next
				137	original code addr is loaded into
				138	<computeroutput>%eax</computeroutput>, and the translation then
				139	does a <computeroutput>ret</computeroutput>, taking it back to
				140	the dispatch loop, with, interestingly, zero branch
				141	mispredictions. The address requested in
				142	<computeroutput>%eax</computeroutput> is looked up first in
				143	<computeroutput>VG_(tt_fast)</computeroutput>, and, if not found,
				144	by calling C helper
				145	<computeroutput>VG_(search_transtab)</computeroutput>. If there
				146	is still no translation available,
				147	<computeroutput>VG_(dispatch)</computeroutput> exits back to the
				148	top-level C dispatcher
				149	<computeroutput>VG_(toploop)</computeroutput>, which arranges for
				150	<computeroutput>VG_(translate)</computeroutput> to make a new
				151	translation. All fairly unsurprising, really. There are various
				152	complexities described below.</para>
				153
				154	<para>The translator, orchestrated by
				155	<computeroutput>VG_(translate)</computeroutput>, is complicated
				156	but entirely self-contained. It is described in great detail in
				157	subsequent sections. Translations are stored in TC, with TT
				158	tracking administrative information. The translations are
				159	subject to an approximate LRU-based management scheme. With the
				160	current settings, the TC can hold at most about 15MB of
				161	translations, and LRU passes prune it to about 13.5MB. Given
				162	that the orig-to-translation expansion ratio is about 13:1 to
				163	14:1, this means TC holds translations for more or less a
				164	megabyte of original code, which generally comes to about 70000
				165	basic blocks for C++ compiled with optimisation on. Generating
				166	new translations is expensive, so it is worth having a large TC
				167	to minimise the (capacity) miss rate.</para>
				168
				169	<para>The dispatcher,
				170	<computeroutput>VG_(dispatch)</computeroutput>, receives hints
				171	from the translations which allow it to cheaply spot all control
				172	transfers corresponding to x86
				173	<computeroutput>call</computeroutput> and
				174	<computeroutput>ret</computeroutput> instructions. It has to do
				175	this in order to spot some special events:</para>
				176
				177	<itemizedlist>
				178	<listitem>
				179	<para>Calls to
				180	<computeroutput>VG_(shutdown)</computeroutput>. This is
				181	Valgrind's cue to exit. NOTE: actually this is done a
				182	different way; it should be cleaned up.</para>
				183	</listitem>
				184
				185	<listitem>
				186	<para>Returns of system call handlers, to the return address
				187	<computeroutput>VG_(signalreturn_bogusRA)</computeroutput>.
				188	The signal simulator needs to know when a signal handler is
				189	returning, so we spot jumps (returns) to this address.</para>
				190	</listitem>
				191
				192	<listitem>
				193	<para>Calls to <computeroutput>vg_trap_here</computeroutput>.
				194	All <computeroutput>malloc</computeroutput>,
				195	<computeroutput>free</computeroutput>, etc calls that the
				196	client program makes are eventually routed to a call to
				197	<computeroutput>vg_trap_here</computeroutput>, and Valgrind
				198	does its own special thing with these calls. In effect this
				199	provides a trapdoor, by which Valgrind can intercept certain
				200	calls on the simulated CPU, run the call as it sees fit
				201	itself (on the real CPU), and return the result to the
				202	simulated CPU, quite transparently to the client
				203	program.</para>
				204	</listitem>
				205
				206	</itemizedlist>
				207
				208	<para>Valgrind intercepts the client's
				209	<computeroutput>malloc</computeroutput>,
				210	<computeroutput>free</computeroutput>, etc, calls, so that it can
				211	store additional information. Each block
				212	<computeroutput>malloc</computeroutput>'d by the client gives
				213	rise to a shadow block in which Valgrind stores the call stack at
				214	the time of the <computeroutput>malloc</computeroutput> call.
				215	When the client calls <computeroutput>free</computeroutput>,
				216	Valgrind tries to find the shadow block corresponding to the
				217	address passed to <computeroutput>free</computeroutput>, and
				218	emits an error message if none can be found. If it is found, the
				219	block is placed on the freed blocks queue
				220	<computeroutput>vg_freed_list</computeroutput>, it is marked as
				221	inaccessible, and its shadow block now records the call stack at
				222	the time of the <computeroutput>free</computeroutput> call.
				223	Keeping <computeroutput>free</computeroutput>'d blocks in this
				224	queue allows Valgrind to spot all (presumably invalid) accesses
				225	to them. However, once the volume of blocks in the free queue
				226	exceeds <computeroutput>VG_(clo_freelist_vol)</computeroutput>,
				227	blocks are finally removed from the queue.</para>
				228
				229	<para>Keeping track of <literal>A</literal> and
				230	<literal>V</literal> bits (note: if you don't know what these
				231	are, you haven't read the user guide carefully enough) for memory
				232	is done in <filename>vg_memory.c</filename>. This implements a
				233	sparse array structure which covers the entire 4G address space
				234	in a way which is reasonably fast and reasonably space efficient.
				235	The 4G address space is divided up into 64K sections, each
				236	covering 64Kb of address space. Given a 32-bit address, the top
				237	16 bits are used to select one of the 65536 entries in
				238	<computeroutput>VG_(primary_map)</computeroutput>. The resulting
				239	"secondary" (<computeroutput>SecMap</computeroutput>) holds A and
				240	V bits for the 64k of address space chunk corresponding to the
				241	lower 16 bits of the address.</para>
				242
				243	</sect2>
				244
				245
				246
				247	<sect2 id="mc-tech-docs.design" xreflabel="Design decisions">
				248	<title>Design decisions</title>
				249
				250	<para>Some design decisions were motivated by the need to make
				251	Valgrind debuggable. Imagine you are writing a CPU simulator.
				252	It works fairly well. However, you run some large program, like
				253	Netscape, and after tens of millions of instructions, it crashes.
				254	How can you figure out where in your simulator the bug is?</para>
				255
				256	<para>Valgrind's answer is: cheat. Valgrind is designed so that
				257	it is possible to switch back to running the client program on
				258	the real CPU at any point. Using the
				259	<computeroutput>--stop-after= </computeroutput> flag, you can ask
				260	Valgrind to run just some number of basic blocks, and then run
				261	the rest of the way on the real CPU. If you are searching for a
				262	bug in the simulated CPU, you can use this to do a binary search,
				263	which quickly leads you to the specific basic block which is
				264	causing the problem.</para>
				265
				266	<para>This is all very handy. It does constrain the design in
				267	certain unimportant ways. Firstly, the layout of memory, when
				268	viewed from the client's point of view, must be identical
				269	regardless of whether it is running on the real or simulated CPU.
				270	This means that Valgrind can't do pointer swizzling -- well, no
				271	great loss -- and it can't run on the same stack as the client --
				272	again, no great loss. Valgrind operates on its own stack,
				273	<computeroutput>VG_(stack)</computeroutput>, which it switches to
				274	at startup, temporarily switching back to the client's stack when
				275	doing system calls for the client.</para>
				276
				277	<para>Valgrind also receives signals on its own stack,
				278	<computeroutput>VG_(sigstack)</computeroutput>, but for different
				279	gruesome reasons discussed below.</para>
				280
				281	<para>This nice clean
				282	switch-back-to-the-real-CPU-whenever-you-like story is muddied by
				283	signals. Problem is that signals arrive at arbitrary times and
				284	tend to slightly perturb the basic block count, with the result
				285	that you can get close to the basic block causing a problem but
				286	can't home in on it exactly. My kludgey hack is to define
				287	<computeroutput>SIGNAL_SIMULATION</computeroutput> to 1 towards
				288	the bottom of <filename>vg_syscall_mem.c</filename>, so that
				289	signal handlers are run on the real CPU and don't change the BB
				290	counts.</para>
				291
				292	<para>A second hole in the switch-back-to-real-CPU story is that
				293	Valgrind's way of delivering signals to the client is different
				294	from that of the kernel. Specifically, the layout of the signal
				295	delivery frame, and the mechanism used to detect a sighandler
				296	returning, are different. So you can't expect to make the
				297	transition inside a sighandler and still have things working, but
				298	in practice that's not much of a restriction.</para>
				299
				300	<para>Valgrind's implementation of
				301	<computeroutput>malloc</computeroutput>,
				302	<computeroutput>free</computeroutput>, etc, (in
				303	<filename>vg_clientmalloc.c</filename>, not the low-level stuff
				304	in <filename>vg_malloc2.c</filename>) is somewhat complicated by
				305	the need to handle switching back at arbitrary points. It does
				306	work tho.</para>
				307
				308	</sect2>
				309
				310
				311
				312	<sect2 id="mc-tech-docs.correctness" xreflabel="Correctness">
				313	<title>Correctness</title>
				314
				315	<para>There's only one of me, and I have a Real Life (tm) as well
				316	as hacking Valgrind [allegedly :-]. That means I don't have time
				317	to waste chasing endless bugs in Valgrind. My emphasis is
				318	therefore on doing everything as simply as possible, with
				319	correctness, stability and robustness being the number one
				320	priority, more important than performance or functionality. As a
				321	result:</para>
				322
				323	<itemizedlist>
				324
				325	<listitem>
				326	<para>The code is absolutely loaded with assertions, and
				327	these are <command>permanently enabled.</command> I have no
				328	plan to remove or disable them later. Over the past couple
				329	of months, as valgrind has become more widely used, they have
				330	shown their worth, pulling up various bugs which would
				331	otherwise have appeared as hard-to-find segmentation
				332	faults.</para>
				333
				334	<para>I am of the view that it's acceptable to spend 5% of
				335	the total running time of your valgrindified program doing
				336	assertion checks and other internal sanity checks.</para>
				337	</listitem>
				338
				339	<listitem>
				340	<para>Aside from the assertions, valgrind contains various
				341	sets of internal sanity checks, which get run at varying
				342	frequencies during normal operation.
				343	<computeroutput>VG_(do_sanity_checks)</computeroutput> runs
				344	every 1000 basic blocks, which means 500 to 2000 times/second
				345	for typical machines at present. It checks that Valgrind
				346	hasn't overrun its private stack, and does some simple checks
				347	on the memory permissions maps. Once every 25 calls it does
				348	some more extensive checks on those maps. Etc, etc.</para>
				349	<para>The following components also have sanity check code,
				350	which can be enabled to aid debugging:</para>
				351	<itemizedlist>
				352	<listitem><para>The low-level memory-manager
				353	(<computeroutput>VG_(mallocSanityCheckArena)</computeroutput>).
				354	This does a complete check of all blocks and chains in an
				355	arena, which is very slow. Is not engaged by default.</para>
				356	</listitem>
				357
				358	<listitem>
				359	<para>The symbol table reader(s): various checks to
				360	ensure uniqueness of mappings; see
				361	<computeroutput>VG_(read_symbols)</computeroutput> for a
				362	start. Is permanently engaged.</para>
				363	</listitem>
				364
				365	<listitem>
				366	<para>The A and V bit tracking stuff in
				367	<filename>vg_memory.c</filename>. This can be compiled
				368	with cpp symbol
				369	<computeroutput>VG_DEBUG_MEMORY</computeroutput> defined,
				370	which removes all the fast, optimised cases, and uses
				371	simple-but-slow fallbacks instead. Not engaged by
				372	default.</para>
				373	</listitem>
				374
				375	<listitem>
				376	<para>Ditto
				377	<computeroutput>VG_DEBUG_LEAKCHECK</computeroutput>.</para>
				378	</listitem>
				379
				380	<listitem>
				381	<para>The JITter parses x86 basic blocks into sequences
				382	of UCode instructions. It then sanity checks each one
				383	with <computeroutput>VG_(saneUInstr)</computeroutput> and
				384	sanity checks the sequence as a whole with
				385	<computeroutput>VG_(saneUCodeBlock)</computeroutput>.
				386	This stuff is engaged by default, and has caught some
				387	way-obscure bugs in the simulated CPU machinery in its
				388	time.</para>
				389	</listitem>
				390
				391	<listitem>
				392	<para>The system call wrapper does
				393	<computeroutput>VG_(first_and_last_secondaries_look_plausible)</computeroutput>
				394	after every syscall; this is known to pick up bugs in the
				395	syscall wrappers. Engaged by default.</para>
				396	</listitem>
				397
				398	<listitem>
				399	<para>The main dispatch loop, in
				400	<computeroutput>VG_(dispatch)</computeroutput>, checks
				401	that translations do not set
				402	<computeroutput>%ebp</computeroutput> to any value
				403	different from
				404	<computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput>
				405	or <computeroutput>& VG_(baseBlock)</computeroutput>.
				406	In effect this test is free, and is permanently
				407	engaged.</para>
				408	</listitem>
				409
				410	<listitem>
				411	<para>There are a couple of ifdefed-out consistency
				412	checks I inserted whilst debugging the new register
				413	allocater,
				414	<computeroutput>vg_do_register_allocation</computeroutput>.</para>
				415	</listitem>
				416	</itemizedlist>
				417	</listitem>
				418
				419	<listitem>
				420	<para>I try to avoid techniques, algorithms, mechanisms, etc,
				421	for which I can supply neither a convincing argument that
				422	they are correct, nor sanity-check code which might pick up
				423	bugs in my implementation. I don't always succeed in this,
				424	but I try. Basically the idea is: avoid techniques which
				425	are, in practice, unverifiable, in some sense. When doing
				426	anything, always have in mind: "how can I verify that this is
				427	correct?"</para>
				428	</listitem>
				429
				430	</itemizedlist>
				431
				432
				433	<para>Some more specific things are:</para>
				434	<itemizedlist>
				435	<listitem>
				436	<para>Valgrind runs in the same namespace as the client, at
				437	least from <filename>ld.so</filename>'s point of view, and it
				438	therefore absolutely had better not export any symbol with a
				439	name which could clash with that of the client or any of its
				440	libraries. Therefore, all globally visible symbols exported
				441	from <filename>valgrind.so</filename> are defined using the
				442	<computeroutput>VG_</computeroutput> CPP macro. As you'll
				443	see from <filename>vg_constants.h</filename>, this appends
				444	some arbitrary prefix to the symbol, in order that it be, we
				445	hope, globally unique. Currently the prefix is
				446	<computeroutput>vgPlain_</computeroutput>. For convenience
				447	there are also <computeroutput>VGM_</computeroutput>,
				448	<computeroutput>VGP_</computeroutput> and
				449	<computeroutput>VGOFF_</computeroutput>. All locally defined
				450	symbols are declared <computeroutput>static</computeroutput>
				451	and do not appear in the final shared object.</para>
				452
				453	<para>To check this, I periodically do <computeroutput>nm
				454	valgrind.so \| grep " T "</computeroutput>, which shows you
				455	all the globally exported text symbols. They should all have
				456	an approved prefix, except for those like
				457	<computeroutput>malloc</computeroutput>,
				458	<computeroutput>free</computeroutput>, etc, which we
				459	deliberately want to shadow and take precedence over the same
				460	names exported from <filename>glibc.so</filename>, so that
				461	valgrind can intercept those calls easily. Similarly,
				462	<computeroutput>nm valgrind.so \| grep " D "</computeroutput>
				463	allows you to find any rogue data-segment symbol
				464	names.</para>
				465	</listitem>
				466
				467	<listitem>
				468	<para>Valgrind tries, and almost succeeds, in being
				469	completely independent of all other shared objects, in
				470	particular of <filename>glibc.so</filename>. For example, we
				471	have our own low-level memory manager in
				472	<filename>vg_malloc2.c</filename>, which is a fairly standard
				473	malloc/free scheme augmented with arenas, and
				474	<filename>vg_mylibc.c</filename> exports reimplementations of
				475	various bits and pieces you'd normally get from the C
				476	library.</para>
				477
				478	<para>Why all the hassle? Because imagine the potential
				479	chaos of both the simulated and real CPUs executing in
				480	<filename>glibc.so</filename>. It just seems simpler and
				481	cleaner to be completely self-contained, so that only the
				482	simulated CPU visits <filename>glibc.so</filename>. In
				483	practice it's not much hassle anyway. Also, valgrind starts
				484	up before glibc has a chance to initialise itself, and who
				485	knows what difficulties that could lead to. Finally, glibc
				486	has definitions for some types, specifically
				487	<computeroutput>sigset_t</computeroutput>, which conflict
				488	(are different from) the Linux kernel's idea of same. When
				489	Valgrind wants to fiddle around with signal stuff, it wants
				490	to use the kernel's definitions, not glibc's definitions. So
				491	it's simplest just to keep glibc out of the picture
				492	entirely.</para>
				493
				494	<para>To find out which glibc symbols are used by Valgrind,
				495	reinstate the link flags <computeroutput>-nostdlib
				496	-Wl,-no-undefined</computeroutput>. This causes linking to
				497	fail, but will tell you what you depend on. I have mostly,
				498	but not entirely, got rid of the glibc dependencies; what
				499	remains is, IMO, fairly harmless. AFAIK the current
				500	dependencies are: <computeroutput>memset</computeroutput>,
				501	<computeroutput>memcmp</computeroutput>,
				502	<computeroutput>stat</computeroutput>,
				503	<computeroutput>system</computeroutput>,
				504	<computeroutput>sbrk</computeroutput>,
				505	<computeroutput>setjmp</computeroutput> and
				506	<computeroutput>longjmp</computeroutput>.</para>
				507	</listitem>
				508
				509	<listitem>
				510	<para>Similarly, valgrind should not really import any
				511	headers other than the Linux kernel headers, since it knows
				512	of no API other than the kernel interface to talk to. At the
				513	moment this is really not in a good state, and
				514	<computeroutput>vg_syscall_mem</computeroutput> imports, via
				515	<filename>vg_unsafe.h</filename>, a significant number of
				516	C-library headers so as to know the sizes of various structs
				517	passed across the kernel boundary. This is of course
				518	completely bogus, since there is no guarantee that the C
				519	library's definitions of these structs matches those of the
				520	kernel. I have started to sort this out using
				521	<filename>vg_kerneliface.h</filename>, into which I had
				522	intended to copy all kernel definitions which valgrind could
				523	need, but this has not gotten very far. At the moment it
				524	mostly contains definitions for
				525	<computeroutput>sigset_t</computeroutput> and
				526	<computeroutput>struct sigaction</computeroutput>, since the
				527	kernel's definition for these really does clash with glibc's.
				528	I plan to use a <computeroutput>vki_</computeroutput> prefix
				529	on all these types and constants, to denote the fact that
				530	they pertain to <command>V</command>algrind's
				531	<command>K</command>ernel
				532	<command>I</command>nterface.</para>
				533
				534	<para>Another advantage of having a
				535	<filename>vg_kerneliface.h</filename> file is that it makes
				536	it simpler to interface to a different kernel. Once can, for
				537	example, easily imagine writing a new
				538	<filename>vg_kerneliface.h</filename> for FreeBSD, or x86
				539	NetBSD.</para>
				540	</listitem>
				541
				542	</itemizedlist>
				543
				544	</sect2>
				545
				546
				547
				548	<sect2 id="mc-tech-docs.limits" xreflabel="Current limitations">
				549	<title>Current limitations</title>
				550
				551	<para>Support for weird (non-POSIX) signal stuff is patchy. Does
				552	anybody care?</para>
				553
				554	</sect2>
				555
				556	</sect1>
				557
				558
				559
				560
				561
				562	<sect1 id="mc-tech-docs.jitter" xreflabel="The instrumenting JITter">
				563	<title>The instrumenting JITter</title>
				564
				565	<para>This really is the heart of the matter. We begin with
				566	various side issues.</para>
				567
				568
				569	<sect2 id="mc-tech-docs.storage"
				570	xreflabel="Run-time storage, and the use of host registers">
				571	<title>Run-time storage, and the use of host registers</title>
				572
				573	<para>Valgrind translates client (original) basic blocks into
				574	instrumented basic blocks, which live in the translation cache
				575	TC, until either the client finishes or the translations are
				576	ejected from TC to make room for newer ones.</para>
				577
				578	<para>Since it generates x86 code in memory, Valgrind has
				579	complete control of the use of registers in the translations.
				580	Now pay attention. I shall say this only once, and it is
				581	important you understand this. In what follows I will refer to
				582	registers in the host (real) cpu using their standard names,
				583	<computeroutput>%eax</computeroutput>,
				584	<computeroutput>%edi</computeroutput>, etc. I refer to registers
				585	in the simulated CPU by capitalising them:
				586	<computeroutput>%EAX</computeroutput>,
				587	<computeroutput>%EDI</computeroutput>, etc. These two sets of
				588	registers usually bear no direct relationship to each other;
				589	there is no fixed mapping between them. This naming scheme is
				590	used fairly consistently in the comments in the sources.</para>
				591
				592	<para>Host registers, once things are up and running, are used as
				593	follows:</para>
				594
				595	<itemizedlist>
				596	<listitem>
				597	<para><computeroutput>%esp</computeroutput>, the real stack
				598	pointer, points somewhere in Valgrind's private stack area,
				599	<computeroutput>VG_(stack)</computeroutput> or, transiently,
				600	into its signal delivery stack,
				601	<computeroutput>VG_(sigstack)</computeroutput>.</para>
				602	</listitem>
				603
				604	<listitem>
				605	<para><computeroutput>%edi</computeroutput> is used as a
				606	temporary in code generation; it is almost always dead,
				607	except when used for the
				608	<computeroutput>Left</computeroutput> value-tag operations.</para>
				609	</listitem>
				610
				611	<listitem>
				612	<para><computeroutput>%eax</computeroutput>,
				613	<computeroutput>%ebx</computeroutput>,
				614	<computeroutput>%ecx</computeroutput>,
				615	<computeroutput>%edx</computeroutput> and
				616	<computeroutput>%esi</computeroutput> are available to
				617	Valgrind's register allocator. They are dead (carry
				618	unimportant values) in between translations, and are live
				619	only in translations. The one exception to this is
				620	<computeroutput>%eax</computeroutput>, which, as mentioned
				621	far above, has a special significance to the dispatch loop
				622	<computeroutput>VG_(dispatch)</computeroutput>: when a
				623	translation returns to the dispatch loop,
				624	<computeroutput>%eax</computeroutput> is expected to contain
				625	the original-code-address of the next translation to run.
				626	The register allocator is so good at minimising spill code
				627	that using five regs and not having to save/restore
				628	<computeroutput>%edi</computeroutput> actually gives better
				629	code than allocating to <computeroutput>%edi</computeroutput>
				630	as well, but then having to push/pop it around special
				631	uses.</para>
				632	</listitem>
				633
				634	<listitem>
				635	<para><computeroutput>%ebp</computeroutput> points
				636	permanently at
				637	<computeroutput>VG_(baseBlock)</computeroutput>. Valgrind's
				638	translations are position-independent, partly because this is
				639	convenient, but also because translations get moved around in
				640	TC as part of the LRUing activity. <command>All</command>
				641	static entities which need to be referred to from generated
				642	code, whether data or helper functions, are stored starting
				643	at <computeroutput>VG_(baseBlock)</computeroutput> and are
				644	therefore reached by indexing from
				645	<computeroutput>%ebp</computeroutput>. There is but one
				646	exception, which is that by placing the value
				647	<computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput> in
				648	<computeroutput>%ebp</computeroutput> just before a return to
				649	the dispatcher, the dispatcher is informed that the next
				650	address to run, in <computeroutput>%eax</computeroutput>,
				651	requires special treatment.</para>
				652	</listitem>
				653
				654	<listitem>
				655	<para>The real machine's FPU state is pretty much
				656	unimportant, for reasons which will become obvious. Ditto
				657	its <computeroutput>%eflags</computeroutput> register.</para>
				658	</listitem>
				659
				660	</itemizedlist>
				661
				662	<para>The state of the simulated CPU is stored in memory, in
				663	<computeroutput>VG_(baseBlock)</computeroutput>, which is a block
				664	of 200 words IIRC. Recall that
				665	<computeroutput>%ebp</computeroutput> points permanently at the
				666	start of this block. Function
				667	<computeroutput>vg_init_baseBlock</computeroutput> decides what
				668	the offsets of various entities in
				669	<computeroutput>VG_(baseBlock)</computeroutput> are to be, and
				670	allocates word offsets for them. The code generator then emits
				671	<computeroutput>%ebp</computeroutput> relative addresses to get
				672	at those things. The sequence in which entities are allocated
				673	has been carefully chosen so that the 32 most popular entities
				674	come first, because this means 8-bit offsets can be used in the
				675	generated code.</para>
				676
				677	<para>If I was clever, I could make
				678	<computeroutput>%ebp</computeroutput> point 32 words along
				679	<computeroutput>VG_(baseBlock)</computeroutput>, so that I'd have
				680	another 32 words of short-form offsets available, but that's just
				681	complicated, and it's not important -- the first 32 words take
				682	99% (or whatever) of the traffic.</para>
				683
				684	<para>Currently, the sequence of stuff in
				685	<computeroutput>VG_(baseBlock)</computeroutput> is as
				686	follows:</para>
				687
				688	<itemizedlist>
				689	<listitem>
				690	<para>9 words, holding the simulated integer registers,
				691	<computeroutput>%EAX</computeroutput>
				692	.. <computeroutput>%EDI</computeroutput>, and the simulated
				693	flags, <computeroutput>%EFLAGS</computeroutput>.</para>
				694	</listitem>
				695
				696	<listitem>
				697	<para>Another 9 words, holding the V bit "shadows" for the
				698	above 9 regs.</para>
				699	</listitem>
				700
				701	<listitem>
				702	<para>The <command>addresses</command> of various helper
				703	routines called from generated code:
				704	<computeroutput>VG_(helper_value_check4_fail)</computeroutput>,
				705	<computeroutput>VG_(helper_value_check0_fail)</computeroutput>,
				706	which register V-check failures,
				707	<computeroutput>VG_(helperc_STOREV4)</computeroutput>,
				708	<computeroutput>VG_(helperc_STOREV1)</computeroutput>,
				709	<computeroutput>VG_(helperc_LOADV4)</computeroutput>,
				710	<computeroutput>VG_(helperc_LOADV1)</computeroutput>, which
				711	do stores and loads of V bits to/from the sparse array which
				712	keeps track of V bits in memory, and
				713	<computeroutput>VGM_(handle_esp_assignment)</computeroutput>,
				714	which messes with memory addressibility resulting from
				715	changes in <computeroutput>%ESP</computeroutput>.</para>
				716	</listitem>
				717
				718	<listitem>
				719	<para>The simulated <computeroutput>%EIP</computeroutput>.</para>
				720	</listitem>
				721
				722	<listitem>
				723	<para>24 spill words, for when the register allocator can't
				724	make it work with 5 measly registers.</para>
				725	</listitem>
				726
				727	<listitem>
				728	<para>Addresses of helpers
				729	<computeroutput>VG_(helperc_STOREV2)</computeroutput>,
				730	<computeroutput>VG_(helperc_LOADV2)</computeroutput>. These
				731	are here because 2-byte loads and stores are relatively rare,
				732	so are placed above the magic 32-word offset boundary.</para>
				733	</listitem>
				734
				735	<listitem>
				736	<para>For similar reasons, addresses of helper functions
				737	<computeroutput>VGM_(fpu_write_check)</computeroutput> and
				738	<computeroutput>VGM_(fpu_read_check)</computeroutput>, which
				739	handle the A/V maps testing and changes required by FPU
				740	writes/reads.</para>
				741	</listitem>
				742
				743	<listitem>
				744	<para>Some other boring helper addresses:
				745	<computeroutput>VG_(helper_value_check2_fail)</computeroutput>
				746	and
				747	<computeroutput>VG_(helper_value_check1_fail)</computeroutput>.
				748	These are probably never emitted now, and should be
				749	removed.</para>
				750	</listitem>
				751
				752	<listitem>
				753	<para>The entire state of the simulated FPU, which I believe
				754	to be 108 bytes long.</para>
				755	</listitem>
				756
				757	<listitem>
				758	<para>Finally, the addresses of various other helper
				759	functions in <filename>vg_helpers.S</filename>, which deal
				760	with rare situations which are tedious or difficult to
				761	generate code in-line for.</para>
				762	</listitem>
				763
				764	</itemizedlist>
				765
				766	<para>As a general rule, the simulated machine's state lives
				767	permanently in memory at
				768	<computeroutput>VG_(baseBlock)</computeroutput>. However, the
				769	JITter does some optimisations which allow the simulated integer
				770	registers to be cached in real registers over multiple simulated
				771	instructions within the same basic block. These are always
				772	flushed back into memory at the end of every basic block, so that
				773	the in-memory state is up-to-date between basic blocks. (This
				774	flushing is implied by the statement above that the real
				775	machine's allocatable registers are dead in between simulated
				776	blocks).</para>
				777
				778	</sect2>
				779
				780
				781
				782	<sect2 id="mc-tech-docs.startup"
				783	xreflabel="Startup, shutdown, and system calls">
				784	<title>Startup, shutdown, and system calls</title>
				785
				786	<para>Getting into of Valgrind
				787	(<computeroutput>VG_(startup)</computeroutput>, called from
				788	<filename>valgrind.so</filename>'s initialisation section),
				789	really means copying the real CPU's state into
				790	<computeroutput>VG_(baseBlock)</computeroutput>, and then
				791	installing our own stack pointer, etc, into the real CPU, and
				792	then starting up the JITter. Exiting valgrind involves copying
				793	the simulated state back to the real state.</para>
				794
				795	<para>Unfortunately, there's a complication at startup time.
				796	Problem is that at the point where we need to take a snapshot of
				797	the real CPU's state, the offsets in
				798	<computeroutput>VG_(baseBlock)</computeroutput> are not set up
				799	yet, because to do so would involve disrupting the real machine's
				800	state significantly. The way round this is to dump the real
				801	machine's state into a temporary, static block of memory,
				802	<computeroutput>VG_(m_state_static)</computeroutput>. We can
				803	then set up the <computeroutput>VG_(baseBlock)</computeroutput>
				804	offsets at our leisure, and copy into it from
				805	<computeroutput>VG_(m_state_static)</computeroutput> at some
				806	convenient later time. This copying is done by
				807	<computeroutput>VG_(copy_m_state_static_to_baseBlock)</computeroutput>.</para>
				808
				809	<para>On exit, the inverse transformation is (rather
				810	unnecessarily) used: stuff in
				811	<computeroutput>VG_(baseBlock)</computeroutput> is copied to
				812	<computeroutput>VG_(m_state_static)</computeroutput>, and the
				813	assembly stub then copies from
				814	<computeroutput>VG_(m_state_static)</computeroutput> into the
				815	real machine registers.</para>
				816
				817	<para>Doing system calls on behalf of the client
				818	(<filename>vg_syscall.S</filename>) is something of a half-way
				819	house. We have to make the world look sufficiently like that
				820	which the client would normally have to make the syscall actually
				821	work properly, but we can't afford to lose control. So the trick
				822	is to copy all of the client's state, <command>except its program
				823	counter</command>, into the real CPU, do the system call, and
				824	copy the state back out. Note that the client's state includes
				825	its stack pointer register, so one effect of this partial
				826	restoration is to cause the system call to be run on the client's
				827	stack, as it should be.</para>
				828
				829	<para>As ever there are complications. We have to save some of
				830	our own state somewhere when restoring the client's state into
				831	the CPU, so that we can keep going sensibly afterwards. In fact
				832	the only thing which is important is our own stack pointer, but
				833	for paranoia reasons I save and restore our own FPU state as
				834	well, even though that's probably pointless.</para>
				835
				836	<para>The complication on the above complication is, that for
				837	horrible reasons to do with signals, we may have to handle a
				838	second client system call whilst the client is blocked inside
				839	some other system call (unbelievable!). That means there's two
				840	sets of places to dump Valgrind's stack pointer and FPU state
				841	across the syscall, and we decide which to use by consulting
				842	<computeroutput>VG_(syscall_depth)</computeroutput>, which is in
				843	turn maintained by
				844	<computeroutput>VG_(wrap_syscall)</computeroutput>.</para>
				845
				846	</sect2>
				847
				848
				849
				850	<sect2 id="mc-tech-docs.ucode" xreflabel="Introduction to UCode">
				851	<title>Introduction to UCode</title>
				852
				853	<para>UCode lies at the heart of the x86-to-x86 JITter. The
				854	basic premise is that dealing the the x86 instruction set head-on
				855	is just too darn complicated, so we do the traditional
				856	compiler-writer's trick and translate it into a simpler,
				857	easier-to-deal-with form.</para>
				858
				859	<para>In normal operation, translation proceeds through six
				860	stages, coordinated by
				861	<computeroutput>VG_(translate)</computeroutput>:</para>
				862
				863	<orderedlist>
				864	<listitem>
				865	<para>Parsing of an x86 basic block into a sequence of UCode
				866	instructions (<computeroutput>VG_(disBB)</computeroutput>).</para>
				867	</listitem>
				868
				869	<listitem>
				870	<para>UCode optimisation
				871	(<computeroutput>vg_improve</computeroutput>), with the aim
				872	of caching simulated registers in real registers over
				873	multiple simulated instructions, and removing redundant
				874	simulated <computeroutput>%EFLAGS</computeroutput>
				875	saving/restoring.</para>
				876	</listitem>
				877
				878	<listitem>
				879	<para>UCode instrumentation
				880	(<computeroutput>vg_instrument</computeroutput>), which adds
				881	value and address checking code.</para>
				882	</listitem>
				883
				884	<listitem>
				885	<para>Post-instrumentation cleanup
				886	(<computeroutput>vg_cleanup</computeroutput>), removing
				887	redundant value-check computations.</para>
				888	</listitem>
				889
				890	<listitem>
				891	<para>Register allocation
				892	(<computeroutput>vg_do_register_allocation</computeroutput>),
				893	which, note, is done on UCode.</para>
				894	</listitem>
				895
				896	<listitem>
				897	<para>Emission of final instrumented x86 code
				898	(<computeroutput>VG_(emit_code)</computeroutput>).</para>
				899	</listitem>
				900
				901	</orderedlist>
				902
				903	<para>Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
				904	transformation passes, all on straight-line blocks of UCode (type
				905	<computeroutput>UCodeBlock</computeroutput>). Steps 2 and 4 are
				906	optimisation passes and can be disabled for debugging purposes,
				907	with <computeroutput>--optimise=no</computeroutput> and
				908	<computeroutput>--cleanup=no</computeroutput> respectively.</para>
				909
				910	<para>Valgrind can also run in a no-instrumentation mode, given
				911	<computeroutput>--instrument=no</computeroutput>. This is useful
				912	for debugging the JITter quickly without having to deal with the
				913	complexity of the instrumentation mechanism too. In this mode,
				914	steps 3 and 4 are omitted.</para>
				915
				916	<para>These flags combine, so that
				917	<computeroutput>--instrument=no</computeroutput> together with
				918	<computeroutput>--optimise=no</computeroutput> means only steps
				919	1, 5 and 6 are used.
				920	<computeroutput>--single-step=yes</computeroutput> causes each
				921	x86 instruction to be treated as a single basic block. The
				922	translations are terrible but this is sometimes instructive.</para>
				923
				924	<para>The <computeroutput>--stop-after=N</computeroutput> flag
				925	switches back to the real CPU after
				926	<computeroutput>N</computeroutput> basic blocks. It also re-JITs
				927	the final basic block executed and prints the debugging info
				928	resulting, so this gives you a way to get a quick snapshot of how
				929	a basic block looks as it passes through the six stages mentioned
				930	above. If you want to see full information for every block
				931	translated (probably not, but still ...) find, in
				932	<computeroutput>VG_(translate)</computeroutput>, the lines</para>
				933	<programlisting><![CDATA[
				934	dis = True;
				935	dis = debugging_translation;]]></programlisting>
				936
				937	<para>and comment out the second line. This will spew out
				938	debugging junk faster than you can possibly imagine.</para>
				939
				940	</sect2>
				941
				942
				943
				944	<sect2 id="mc-tech-docs.tags" xreflabel="UCode operand tags: type 'Tag'">
				945	<title>UCode operand tags: type <computeroutput>Tag</computeroutput></title>
				946
				947	<para>UCode is, more or less, a simple two-address RISC-like
				948	code. In keeping with the x86 AT&T assembly syntax,
				949	generally speaking the first operand is the source operand, and
				950	the second is the destination operand, which is modified when the
				951	uinstr is notionally executed.</para>
				952
				953	<para>UCode instructions have up to three operand fields, each of
				954	which has a corresponding <computeroutput>Tag</computeroutput>
				955	describing it. Possible values for the tag are:</para>
				956
				957	<itemizedlist>
				958
				959	<listitem>
				960	<para><computeroutput>NoValue</computeroutput>: indicates
				961	that the field is not in use.</para>
				962	</listitem>
				963
				964	<listitem>
				965	<para><computeroutput>Lit16</computeroutput>: the field
				966	contains a 16-bit literal.</para>
				967	</listitem>
				968
				969	<listitem>
				970	<para><computeroutput>Literal</computeroutput>: the field
				971	denotes a 32-bit literal, whose value is stored in the
				972	<computeroutput>lit32</computeroutput> field of the uinstr
				973	itself. Since there is only one
				974	<computeroutput>lit32</computeroutput> for the whole uinstr,
				975	only one operand field may contain this tag.</para>
				976	</listitem>
				977
				978	<listitem>
				979	<para><computeroutput>SpillNo</computeroutput>: the field
				980	contains a spill slot number, in the range 0 to 23 inclusive,
				981	denoting one of the spill slots contained inside
				982	<computeroutput>VG_(baseBlock)</computeroutput>. Such tags
				983	only exist after register allocation.</para>
				984	</listitem>
				985
				986	<listitem>
				987	<para><computeroutput>RealReg</computeroutput>: the field
				988	contains a number in the range 0 to 7 denoting an integer x86
				989	("real") register on the host. The number is the Intel
				990	encoding for integer registers. Such tags only exist after
				991	register allocation.</para>
				992	</listitem>
				993
				994	<listitem>
				995	<para><computeroutput>ArchReg</computeroutput>: the field
				996	contains a number in the range 0 to 7 denoting an integer x86
				997	register on the simulated CPU. In reality this means a
				998	reference to one of the first 8 words of
				999	<computeroutput>VG_(baseBlock)</computeroutput>. Such tags
				1000	can exist at any point in the translation process.</para>
				1001	</listitem>
				1002
				1003	<listitem>
				1004	<para>Last, but not least,
				1005	<computeroutput>TempReg</computeroutput>. The field contains
				1006	the number of one of an infinite set of virtual (integer)
				1007	registers. <computeroutput>TempReg</computeroutput>s are used
				1008	everywhere throughout the translation process; you can have
				1009	as many as you want. The register allocator maps as many as
				1010	it can into <computeroutput>RealReg</computeroutput>s and
				1011	turns the rest into
				1012	<computeroutput>SpillNo</computeroutput>s, so
				1013	<computeroutput>TempReg</computeroutput>s should not exist
				1014	after the register allocation phase.</para>
				1015
				1016	<para><computeroutput>TempReg</computeroutput>s are always 32
				1017	bits long, even if the data they hold is logically shorter.
				1018	In that case the upper unused bits are required, and, I
				1019	think, generally assumed, to be zero.
				1020	<computeroutput>TempReg</computeroutput>s holding V bits for
				1021	quantities shorter than 32 bits are expected to have ones in
				1022	the unused places, since a one denotes "undefined".</para>
				1023	</listitem>
				1024
				1025	</itemizedlist>
				1026
				1027	</sect2>
				1028
				1029
				1030
				1031	<sect2 id="mc-tech-docs.uinstr"
				1032	xreflabel="UCode instructions: type 'UInstr'">
				1033	<title>UCode instructions: type <computeroutput>UInstr</computeroutput></title>
				1034
				1035	<para>UCode was carefully designed to make it possible to do
				1036	register allocation on UCode and then translate the result into
				1037	x86 code without needing any extra registers ... well, that was
				1038	the original plan, anyway. Things have gotten a little more
				1039	complicated since then. In what follows, UCode instructions are
				1040	referred to as uinstrs, to distinguish them from x86
				1041	instructions. Uinstrs of course have uopcodes which are
				1042	(naturally) different from x86 opcodes.</para>
				1043
				1044	<para>A uinstr (type <computeroutput>UInstr</computeroutput>)
				1045	contains various fields, not all of which are used by any one
				1046	uopcode:</para>
				1047
				1048	<itemizedlist>
				1049
				1050	<listitem>
				1051	<para>Three 16-bit operand fields,
				1052	<computeroutput>val1</computeroutput>,
				1053	<computeroutput>val2</computeroutput> and
				1054	<computeroutput>val3</computeroutput>.</para>
				1055	</listitem>
				1056
				1057	<listitem>
				1058	<para>Three tag fields,
				1059	<computeroutput>tag1</computeroutput>,
				1060	<computeroutput>tag2</computeroutput> and
				1061	<computeroutput>tag3</computeroutput>. Each of these has a
				1062	value of type <computeroutput>Tag</computeroutput>, and they
				1063	describe what the <computeroutput>val1</computeroutput>,
				1064	<computeroutput>val2</computeroutput> and
				1065	<computeroutput>val3</computeroutput> fields contain.</para>
				1066	</listitem>
				1067
				1068	<listitem>
				1069	<para>A 32-bit literal field.</para>
				1070	</listitem>
				1071
				1072	<listitem>
				1073	<para>Two <computeroutput>FlagSet</computeroutput>s,
				1074	specifying which x86 condition codes are read and written by
				1075	the uinstr.</para>
				1076	</listitem>
				1077
				1078	<listitem>
				1079	<para>An opcode byte, containing a value of type
				1080	<computeroutput>Opcode</computeroutput>.</para>
				1081	</listitem>
				1082
				1083	<listitem>
				1084	<para>A size field, indicating the data transfer size
				1085	(1/2/4/8/10) in cases where this makes sense, or zero
				1086	otherwise.</para>
				1087	</listitem>
				1088
				1089	<listitem>
				1090	<para>A condition-code field, which, for jumps, holds a value
				1091	of type <computeroutput>Condcode</computeroutput>, indicating
				1092	the condition which applies. The encoding is as it is in the
				1093	x86 insn stream, except we add a 17th value
				1094	<computeroutput>CondAlways</computeroutput> to indicate an
				1095	unconditional transfer.</para>
				1096	</listitem>
				1097
				1098	<listitem>
				1099	<para>Various 1-bit flags, indicating whether this insn
				1100	pertains to an x86 CALL or RET instruction, whether a
				1101	widening is signed or not, etc.</para>
				1102	</listitem>
				1103
				1104	</itemizedlist>
				1105
				1106	<para>UOpcodes (type <computeroutput>Opcode</computeroutput>) are
				1107	divided into two groups: those necessary merely to express the
				1108	functionality of the x86 code, and extra uopcodes needed to
				1109	express the instrumentation. The former group contains:</para>
				1110
				1111	<itemizedlist>
				1112
				1113	<listitem>
				1114	<para><computeroutput>GET</computeroutput> and
				1115	<computeroutput>PUT</computeroutput>, which move values from
				1116	the simulated CPU's integer registers
				1117	(<computeroutput>ArchReg</computeroutput>s) into
				1118	<computeroutput>TempReg</computeroutput>s, and back.
				1119	<computeroutput>GETF</computeroutput> and
				1120	<computeroutput>PUTF</computeroutput> do the corresponding
				1121	thing for the simulated
				1122	<computeroutput>%EFLAGS</computeroutput>. There are no
				1123	corresponding insns for the FPU register stack, since we
				1124	don't explicitly simulate its registers.</para>
				1125	</listitem>
				1126
				1127	<listitem>
				1128	<para><computeroutput>LOAD</computeroutput> and
				1129	<computeroutput>STORE</computeroutput>, which, in RISC-like
				1130	fashion, are the only uinstrs able to interact with
				1131	memory.</para>
				1132	</listitem>
				1133
				1134	<listitem>
				1135	<para><computeroutput>MOV</computeroutput> and
				1136	<computeroutput>CMOV</computeroutput> allow unconditional and
				1137	conditional moves of values between
				1138	<computeroutput>TempReg</computeroutput>s.</para>
				1139	</listitem>
				1140
				1141	<listitem>
				1142	<para>ALU operations. Again in RISC-like fashion, these only
				1143	operate on <computeroutput>TempReg</computeroutput>s (before
				1144	reg-alloc) or <computeroutput>RealReg</computeroutput>s
				1145	(after reg-alloc). These are:
				1146	<computeroutput>ADD</computeroutput>,
				1147	<computeroutput>ADC</computeroutput>,
				1148	<computeroutput>AND</computeroutput>,
				1149	<computeroutput>OR</computeroutput>,
				1150	<computeroutput>XOR</computeroutput>,
				1151	<computeroutput>SUB</computeroutput>,
				1152	<computeroutput>SBB</computeroutput>,
				1153	<computeroutput>SHL</computeroutput>,
				1154	<computeroutput>SHR</computeroutput>,
				1155	<computeroutput>SAR</computeroutput>,
				1156	<computeroutput>ROL</computeroutput>,
				1157	<computeroutput>ROR</computeroutput>,
				1158	<computeroutput>RCL</computeroutput>,
				1159	<computeroutput>RCR</computeroutput>,
				1160	<computeroutput>NOT</computeroutput>,
				1161	<computeroutput>NEG</computeroutput>,
				1162	<computeroutput>INC</computeroutput>,
				1163	<computeroutput>DEC</computeroutput>,
				1164	<computeroutput>BSWAP</computeroutput>,
				1165	<computeroutput>CC2VAL</computeroutput> and
				1166	<computeroutput>WIDEN</computeroutput>.
				1167	<computeroutput>WIDEN</computeroutput> does signed or
				1168	unsigned value widening.
				1169	<computeroutput>CC2VAL</computeroutput> is used to convert
				1170	condition codes into a value, zero or one. The rest are
				1171	obvious.</para>
				1172
				1173	<para>To allow for more efficient code generation, we bend
				1174	slightly the restriction at the start of the previous para:
				1175	for <computeroutput>ADD</computeroutput>,
				1176	<computeroutput>ADC</computeroutput>,
				1177	<computeroutput>XOR</computeroutput>,
				1178	<computeroutput>SUB</computeroutput> and
				1179	<computeroutput>SBB</computeroutput>, we allow the first
				1180	(source) operand to also be an
				1181	<computeroutput>ArchReg</computeroutput>, that is, one of the
				1182	simulated machine's registers. Also, many of these ALU ops
				1183	allow the source operand to be a literal. See
				1184	<computeroutput>VG_(saneUInstr)</computeroutput> for the
				1185	final word on the allowable forms of uinstrs.</para>
				1186	</listitem>
				1187
				1188	<listitem>
				1189	<para><computeroutput>LEA1</computeroutput> and
				1190	<computeroutput>LEA2</computeroutput> are not strictly
				1191	necessary, but allow faciliate better translations. They
				1192	record the fancy x86 addressing modes in a direct way, which
				1193	allows those amodes to be emitted back into the final
				1194	instruction stream more or less verbatim.</para>
				1195	</listitem>
				1196
				1197	<listitem>
				1198	<para><computeroutput>CALLM</computeroutput> calls a
				1199	machine-code helper, one of the methods whose address is
				1200	stored at some
				1201	<computeroutput>VG_(baseBlock)</computeroutput> offset.
				1202	<computeroutput>PUSH</computeroutput> and
				1203	<computeroutput>POP</computeroutput> move values to/from
				1204	<computeroutput>TempReg</computeroutput> to the real
				1205	(Valgrind's) stack, and
				1206	<computeroutput>CLEAR</computeroutput> removes values from
				1207	the stack. <computeroutput>CALLM_S</computeroutput> and
				1208	<computeroutput>CALLM_E</computeroutput> delimit the
				1209	boundaries of call setups and clearings, for the benefit of
				1210	the instrumentation passes. Getting this right is critical,
				1211	and so <computeroutput>VG_(saneUCodeBlock)</computeroutput>
				1212	makes various checks on the use of these uopcodes.</para>
				1213
				1214	<para>It is important to understand that these uopcodes have
				1215	nothing to do with the x86
				1216	<computeroutput>call</computeroutput>,
				1217	<computeroutput>return,</computeroutput>
				1218	<computeroutput>push</computeroutput> or
				1219	<computeroutput>pop</computeroutput> instructions, and are
				1220	not used to implement them. Those guys turn into
				1221	combinations of <computeroutput>GET</computeroutput>,
				1222	<computeroutput>PUT</computeroutput>,
				1223	<computeroutput>LOAD</computeroutput>,
				1224	<computeroutput>STORE</computeroutput>,
				1225	<computeroutput>ADD</computeroutput>,
				1226	<computeroutput>SUB</computeroutput>, and
				1227	<computeroutput>JMP</computeroutput>. What these uopcodes
				1228	support is calling of helper functions such as
				1229	<computeroutput>VG_(helper_imul_32_64)</computeroutput>,
				1230	which do stuff which is too difficult or tedious to emit
				1231	inline.</para>
				1232	</listitem>
				1233
				1234	<listitem>
				1235	<para><computeroutput>FPU</computeroutput>,
				1236	<computeroutput>FPU_R</computeroutput> and
				1237	<computeroutput>FPU_W</computeroutput>. Valgrind doesn't
				1238	attempt to simulate the internal state of the FPU at all.
				1239	Consequently it only needs to be able to distinguish FPU ops
				1240	which read and write memory from those that don't, and for
				1241	those which do, it needs to know the effective address and
				1242	data transfer size. This is made easier because the x86 FP
				1243	instruction encoding is very regular, basically consisting of
				1244	16 bits for a non-memory FPU insn and 11 (IIRC) bits + an
				1245	address mode for a memory FPU insn. So our
				1246	<computeroutput>FPU</computeroutput> uinstr carries the 16
				1247	bits in its <computeroutput>val1</computeroutput> field. And
				1248	<computeroutput>FPU_R</computeroutput> and
				1249	<computeroutput>FPU_W</computeroutput> carry 11 bits in that
				1250	field, together with the identity of a
				1251	<computeroutput>TempReg</computeroutput> or (later)
				1252	<computeroutput>RealReg</computeroutput> which contains the
				1253	address.</para>
				1254	</listitem>
				1255
				1256	<listitem>
				1257	<para><computeroutput>JIFZ</computeroutput> is unique, in
				1258	that it allows a control-flow transfer which is not deemed to
				1259	end a basic block. It causes a jump to a literal (original)
				1260	address if the specified argument is zero.</para>
				1261	</listitem>
				1262
				1263	<listitem>
				1264	<para>Finally, <computeroutput>INCEIP</computeroutput>
				1265	advances the simulated <computeroutput>%EIP</computeroutput>
				1266	by the specified literal amount. This supports lazy
				1267	<computeroutput>%EIP</computeroutput> updating, as described
				1268	below.</para>
				1269	</listitem>
				1270
				1271	</itemizedlist>
				1272
				1273	<para>Stages 1 and 2 of the 6-stage translation process mentioned
				1274	above deal purely with these uopcodes, and no others. They are
				1275	sufficient to express pretty much all the x86 32-bit
				1276	protected-mode instruction set, at least everything understood by
				1277	a pre-MMX original Pentium (P54C).</para>
				1278
				1279	<para>Stages 3, 4, 5 and 6 also deal with the following extra
				1280	"instrumentation" uopcodes. They are used to express all the
				1281	definedness-tracking and -checking machinery which valgrind does.
				1282	In later sections we show how to create checking code for each of
				1283	the uopcodes above. Note that these instrumentation uopcodes,
				1284	although some appearing complicated, have been carefully chosen
				1285	so that efficient x86 code can be generated for them. GNU
				1286	superopt v2.5 did a great job helping out here. Anyways, the
				1287	uopcodes are as follows:</para>
				1288
				1289	<itemizedlist>
				1290
				1291	<listitem>
				1292	<para><computeroutput>GETV</computeroutput> and
				1293	<computeroutput>PUTV</computeroutput> are analogues to
				1294	<computeroutput>GET</computeroutput> and
				1295	<computeroutput>PUT</computeroutput> above. They are
				1296	identical except that they move the V bits for the specified
				1297	values back and forth to
				1298	<computeroutput>TempRegs</computeroutput>, rather than moving
				1299	the values themselves.</para>
				1300	</listitem>
				1301
				1302	<listitem>
				1303	<para>Similarly, <computeroutput>LOADV</computeroutput> and
				1304	<computeroutput>STOREV</computeroutput> read and write V bits
				1305	from the synthesised shadow memory that Valgrind maintains.
				1306	In fact they do more than that, since they also do
				1307	address-validity checks, and emit complaints if the
				1308	read/written addresses are unaddressible.</para>
				1309	</listitem>
				1310
				1311	<listitem>
				1312	<para><computeroutput>TESTV</computeroutput>, whose
				1313	parameters are a <computeroutput>TempReg</computeroutput> and
				1314	a size, tests the V bits in the
				1315	<computeroutput>TempReg</computeroutput>, at the specified
				1316	operation size (0/1/2/4 byte) and emits an error if any of
				1317	them indicate undefinedness. This is the only uopcode
				1318	capable of doing such tests.</para>
				1319	</listitem>
				1320
				1321	<listitem>
				1322	<para><computeroutput>SETV</computeroutput>, whose parameters
				1323	are also <computeroutput>TempReg</computeroutput> and a size,
				1324	makes the V bits in the
				1325	<computeroutput>TempReg</computeroutput> indicated
				1326	definedness, at the specified operation size. This is
				1327	usually used to generate the correct V bits for a literal
				1328	value, which is of course fully defined.</para>
				1329	</listitem>
				1330
				1331	<listitem>
				1332	<para><computeroutput>GETVF</computeroutput> and
				1333	<computeroutput>PUTVF</computeroutput> are analogues to
				1334	<computeroutput>GETF</computeroutput> and
				1335	<computeroutput>PUTF</computeroutput>. They move the single
				1336	V bit used to model definedness of
				1337	<computeroutput>%EFLAGS</computeroutput> between its home in
				1338	<computeroutput>VG_(baseBlock)</computeroutput> and the
				1339	specified <computeroutput>TempReg</computeroutput>.</para>
				1340	</listitem>
				1341
				1342	<listitem>
				1343	<para><computeroutput>TAG1</computeroutput> denotes one of a
				1344	family of unary operations on
				1345	<computeroutput>TempReg</computeroutput>s containing V bits.
				1346	Similarly, <computeroutput>TAG2</computeroutput> denotes one
				1347	in a family of binary operations on V bits.</para>
				1348	</listitem>
				1349
				1350	</itemizedlist>
				1351
				1352
				1353	<para>These 10 uopcodes are sufficient to express Valgrind's
				1354	entire definedness-checking semantics. In fact most of the
				1355	interesting magic is done by the
				1356	<computeroutput>TAG1</computeroutput> and
				1357	<computeroutput>TAG2</computeroutput> suboperations.</para>
				1358
				1359	<para>First, however, I need to explain about V-vector operation
				1360	sizes. There are 4 sizes: 1, 2 and 4, which operate on groups of
				1361	8, 16 and 32 V bits at a time, supporting the usual 1, 2 and 4
				1362	byte x86 operations. However there is also the mysterious size
				1363	0, which really means a single V bit. Single V bits are used in
				1364	various circumstances; in particular, the definedness of
				1365	<computeroutput>%EFLAGS</computeroutput> is modelled with a
				1366	single V bit. Now might be a good time to also point out that
				1367	for V bits, 1 means "undefined" and 0 means "defined".
				1368	Similarly, for A bits, 1 means "invalid address" and 0 means
				1369	"valid address". This seems counterintuitive (and so it is), but
				1370	testing against zero on x86s saves instructions compared to
				1371	testing against all 1s, because many ALU operations set the Z
				1372	flag for free, so to speak.</para>
				1373
				1374	<para>With that in mind, the tag ops are:</para>
				1375
				1376	<itemizedlist>
				1377
				1378	<listitem>
				1379	<formalpara>
				1380	<title>(UNARY) Pessimising casts:</title>
				1381	<para><computeroutput>VgT_PCast40</computeroutput>,
				1382	<computeroutput>VgT_PCast20</computeroutput>,
				1383	<computeroutput>VgT_PCast10</computeroutput>,
				1384	<computeroutput>VgT_PCast01</computeroutput>,
				1385	<computeroutput>VgT_PCast02</computeroutput> and
				1386	<computeroutput>VgT_PCast04</computeroutput>. A "pessimising
				1387	cast" takes a V-bit vector at one size, and creates a new one
				1388	at another size, pessimised in the sense that if any of the
				1389	bits in the source vector indicate undefinedness, then all
				1390	the bits in the result indicate undefinedness. In this case
				1391	the casts are all to or from a single V bit, so for example
				1392	<computeroutput>VgT_PCast40</computeroutput> is a pessimising
				1393	cast from 32 bits to 1, whereas
				1394	<computeroutput>VgT_PCast04</computeroutput> simply copies
				1395	the single source V bit into all 32 bit positions in the
				1396	result. Surprisingly, these ops can all be implemented very
				1397	efficiently.</para>
				1398	</formalpara>
				1399
				1400	<para>There are also the pessimising casts
				1401	<computeroutput>VgT_PCast14</computeroutput>, from 8 bits to
				1402	32, <computeroutput>VgT_PCast12</computeroutput>, from 8 bits
				1403	to 16, and <computeroutput>VgT_PCast11</computeroutput>, from
				1404	8 bits to 8. This last one seems nonsensical, but in fact it
				1405	isn't a no-op because, as mentioned above, any undefined (1)
				1406	bits in the source infect the entire result.</para>
				1407	</listitem>
				1408
				1409	<listitem>
				1410	<formalpara>
				1411	<title>(UNARY) Propagating undefinedness upwards in a
				1412	word:</title>
				1413	<para><computeroutput>VgT_Left4</computeroutput>,
				1414	<computeroutput>VgT_Left2</computeroutput> and
				1415	<computeroutput>VgT_Left1</computeroutput>. These are used
				1416	to simulate the worst-case effects of carry propagation in
				1417	adds and subtracts. They return a V vector identical to the
				1418	original, except that if the original contained any undefined
				1419	bits, then it and all bits above it are marked as undefined
				1420	too. Hence the Left bit in the names.</para></formalpara>
				1421	</listitem>
				1422
				1423	<listitem>
				1424	<formalpara>
				1425	<title>(UNARY) Signed and unsigned value widening:</title>
				1426	<para><computeroutput>VgT_SWiden14</computeroutput>,
				1427	<computeroutput>VgT_SWiden24</computeroutput>,
				1428	<computeroutput>VgT_SWiden12</computeroutput>,
				1429	<computeroutput>VgT_ZWiden14</computeroutput>,
				1430	<computeroutput>VgT_ZWiden24</computeroutput> and
				1431	<computeroutput>VgT_ZWiden12</computeroutput>. These mimic
				1432	the definedness effects of standard signed and unsigned
				1433	integer widening. Unsigned widening creates zero bits in the
				1434	new positions, so
				1435	<computeroutput>VgT_ZWiden*</computeroutput> accordingly park
				1436	mark those parts of their argument as defined. Signed
				1437	widening copies the sign bit into the new positions, so
				1438	<computeroutput>VgT_SWiden*</computeroutput> copies the
				1439	definedness of the sign bit into the new positions. Because
				1440	1 means undefined and 0 means defined, these operations can
				1441	(fascinatingly) be done by the same operations which they
				1442	mimic. Go figure.</para>
				1443	</formalpara>
				1444	</listitem>
				1445
				1446	<listitem>
				1447	<formalpara>
				1448	<title>(BINARY) Undefined-if-either-Undefined,
				1449	Defined-if-either-Defined:</title>
				1450	<para><computeroutput>VgT_UifU4</computeroutput>,
				1451	<computeroutput>VgT_UifU2</computeroutput>,
				1452	<computeroutput>VgT_UifU1</computeroutput>,
				1453	<computeroutput>VgT_UifU0</computeroutput>,
				1454	<computeroutput>VgT_DifD4</computeroutput>,
				1455	<computeroutput>VgT_DifD2</computeroutput>,
				1456	<computeroutput>VgT_DifD1</computeroutput>. These do simple
				1457	bitwise operations on pairs of V-bit vectors, with
				1458	<computeroutput>UifU</computeroutput> giving undefined if
				1459	either arg bit is undefined, and
				1460	<computeroutput>DifD</computeroutput> giving defined if
				1461	either arg bit is defined. Abstract interpretation junkies,
				1462	if any make it this far, may like to think of them as meets
				1463	and joins (or is it joins and meets) in the definedness
				1464	lattices.</para>
				1465	</formalpara>
				1466	</listitem>
				1467
				1468	<listitem>
				1469	<formalpara>
				1470	<title>(BINARY; one value, one V bits) Generate argument
				1471	improvement terms for AND and OR</title>
				1472	<para><computeroutput>VgT_ImproveAND4_TQ</computeroutput>,
				1473	<computeroutput>VgT_ImproveAND2_TQ</computeroutput>,
				1474	<computeroutput>VgT_ImproveAND1_TQ</computeroutput>,
				1475	<computeroutput>VgT_ImproveOR4_TQ</computeroutput>,
				1476	<computeroutput>VgT_ImproveOR2_TQ</computeroutput>,
				1477	<computeroutput>VgT_ImproveOR1_TQ</computeroutput>. These
				1478	help out with AND and OR operations. AND and OR have the
				1479	inconvenient property that the definedness of the result
				1480	depends on the actual values of the arguments as well as
				1481	their definedness. At the bit level:</para></formalpara>
				1482	<programlisting><![CDATA[
				1483	1 AND undefined = undefined, but
				1484	0 AND undefined = 0, and
				1485	similarly
				1486	0 OR undefined = undefined, but
				1487	1 OR undefined = 1.]]></programlisting>
				1488
				1489	<para>It turns out that gcc (quite legitimately) generates
				1490	code which relies on this fact, so we have to model it
				1491	properly in order to avoid flooding users with spurious value
				1492	errors. The ultimate definedness result of AND and OR is
				1493	calculated using <computeroutput>UifU</computeroutput> on the
				1494	definedness of the arguments, but we also
				1495	<computeroutput>DifD</computeroutput> in some "improvement"
				1496	terms which take into account the above phenomena.</para>
				1497
				1498	<para><computeroutput>ImproveAND</computeroutput> takes as
				1499	its first argument the actual value of an argument to AND
				1500	(the T) and the definedness of that argument (the Q), and
				1501	returns a V-bit vector which is defined (0) for bits which
				1502	have value 0 and are defined; this, when
				1503	<computeroutput>DifD</computeroutput> into the final result
				1504	causes those bits to be defined even if the corresponding bit
				1505	in the other argument is undefined.</para>
				1506
				1507	<para>The <computeroutput>ImproveOR</computeroutput> ops do
				1508	the dual thing for OR arguments. Note that XOR does not have
				1509	this property that one argument can make the other
				1510	irrelevant, so there is no need for such complexity for
				1511	XOR.</para>
				1512	</listitem>
				1513
				1514	</itemizedlist>
				1515
				1516	<para>That's all the tag ops. If you stare at this long enough,
				1517	and then run Valgrind and stare at the pre- and post-instrumented
				1518	ucode, it should be fairly obvious how the instrumentation
				1519	machinery hangs together.</para>
				1520
				1521	<para>One point, if you do this: in order to make it easy to
				1522	differentiate <computeroutput>TempReg</computeroutput>s carrying
				1523	values from <computeroutput>TempReg</computeroutput>s carrying V
				1524	bit vectors, Valgrind prints the former as (for example)
				1525	<computeroutput>t28</computeroutput> and the latter as
				1526	<computeroutput>q28</computeroutput>; the fact that they carry
				1527	the same number serves to indicate their relationship. This is
				1528	purely for the convenience of the human reader; the register
				1529	allocator and code generator don't regard them as
				1530	different.</para>
				1531
				1532	</sect2>
				1533
				1534
				1535
				1536	<sect2 id="mc-manual.trans" xreflabel="Translation into UCode">
				1537	<title>Translation into UCode</title>
				1538
				1539	<para><computeroutput>VG_(disBB)</computeroutput> allocates a new
				1540	<computeroutput>UCodeBlock</computeroutput> and then uses
				1541	<computeroutput>disInstr</computeroutput> to translate x86
				1542	instructions one at a time into UCode, dumping the result in the
				1543	<computeroutput>UCodeBlock</computeroutput>. This goes on until
				1544	a control-flow transfer instruction is encountered.</para>
				1545
				1546	<para>Despite the large size of
				1547	<filename>vg_to_ucode.c</filename>, this translation is really
				1548	very simple. Each x86 instruction is translated entirely
				1549	independently of its neighbours, merrily allocating new
				1550	<computeroutput>TempReg</computeroutput>s as it goes. The idea
				1551	is to have a simple translator -- in reality, no more than a
				1552	macro-expander -- and the -- resulting bad UCode translation is
				1553	cleaned up by the UCode optimisation phase which follows. To
				1554	give you an idea of some x86 instructions and their translations
				1555	(this is a complete basic block, as Valgrind sees it):</para>
				1556	<programlisting><![CDATA[
				1557	0x40435A50: incl %edx
				1558	0: GETL %EDX, t0
				1559	1: INCL t0 (-wOSZAP)
				1560	2: PUTL t0, %EDX
				1561
				1562	0x40435A51: movsbl (%edx),%eax
				1563	3: GETL %EDX, t2
				1564	4: LDB (t2), t2
				1565	5: WIDENL_Bs t2
				1566	6: PUTL t2, %EAX
				1567
				1568	0x40435A54: testb $0x20, 1(%ecx,%eax,2)
				1569	7: GETL %EAX, t6
				1570	8: GETL %ECX, t8
				1571	9: LEA2L 1(t8,t6,2), t4
				1572	10: LDB (t4), t10
				1573	11: MOVB $0x20, t12
				1574	12: ANDB t12, t10 (-wOSZACP)
				1575	13: INCEIPo $9
				1576
				1577	0x40435A59: jnz-8 0x40435A50
				1578	14: Jnzo $0x40435A50 (-rOSZACP)
				1579	15: JMPo $0x40435A5B]]></programlisting>
				1580
				1581	<para>Notice how the block always ends with an unconditional jump
				1582	to the next block. This is a bit unnecessary, but makes many
				1583	things simpler.</para>
				1584
				1585	<para>Most x86 instructions turn into sequences of
				1586	<computeroutput>GET</computeroutput>,
				1587	<computeroutput>PUT</computeroutput>,
				1588	<computeroutput>LEA1</computeroutput>,
				1589	<computeroutput>LEA2</computeroutput>,
				1590	<computeroutput>LOAD</computeroutput> and
				1591	<computeroutput>STORE</computeroutput>. Some complicated ones
				1592	however rely on calling helper bits of code in
				1593	<filename>vg_helpers.S</filename>. The ucode instructions
				1594	<computeroutput>PUSH</computeroutput>,
				1595	<computeroutput>POP</computeroutput>,
				1596	<computeroutput>CALL</computeroutput>,
				1597	<computeroutput>CALLM_S</computeroutput> and
				1598	<computeroutput>CALLM_E</computeroutput> support this. The
				1599	calling convention is somewhat ad-hoc and is not the C calling
				1600	convention. The helper routines must save all integer registers,
				1601	and the flags, that they use. Args are passed on the stack
				1602	underneath the return address, as usual, and if result(s) are to
				1603	be returned, it (they) are either placed in dummy arg slots
				1604	created by the ucode <computeroutput>PUSH</computeroutput>
				1605	sequence, or just overwrite the incoming args.</para>
				1606
				1607	<para>In order that the instrumentation mechanism can handle
				1608	calls to these helpers,
				1609	<computeroutput>VG_(saneUCodeBlock)</computeroutput> enforces the
				1610	following restrictions on calls to helpers:</para>
				1611
				1612	<itemizedlist>
				1613
				1614	<listitem>
				1615	<para>Each <computeroutput>CALL</computeroutput> uinstr must
				1616	be bracketed by a preceding
				1617	<computeroutput>CALLM_S</computeroutput> marker (dummy
				1618	uinstr) and a trailing
				1619	<computeroutput>CALLM_E</computeroutput> marker. These
				1620	markers are used by the instrumentation mechanism later to
				1621	establish the boundaries of the
				1622	<computeroutput>PUSH</computeroutput>,
				1623	<computeroutput>POP</computeroutput> and
				1624	<computeroutput>CLEAR</computeroutput> sequences for the
				1625	call.</para>
				1626	</listitem>
				1627
				1628	<listitem>
				1629	<para><computeroutput>PUSH</computeroutput>,
				1630	<computeroutput>POP</computeroutput> and
				1631	<computeroutput>CLEAR</computeroutput> may only appear inside
				1632	sections bracketed by
				1633	<computeroutput>CALLM_S</computeroutput> and
				1634	<computeroutput>CALLM_E</computeroutput>, and nowhere else.</para>
				1635	</listitem>
				1636
				1637	<listitem>
				1638	<para>In any such bracketed section, no two
				1639	<computeroutput>PUSH</computeroutput> insns may push the same
				1640	<computeroutput>TempReg</computeroutput>. Dually, no two two
				1641	<computeroutput>POP</computeroutput>s may pop the same
				1642	<computeroutput>TempReg</computeroutput>.</para>
				1643	</listitem>
				1644
				1645	<listitem>
				1646	<para>Finally, although this is not checked, args should be
				1647	removed from the stack with
				1648	<computeroutput>CLEAR</computeroutput>, rather than
				1649	<computeroutput>POP</computeroutput>s into a
				1650	<computeroutput>TempReg</computeroutput> which is not
				1651	subsequently used. This is because the instrumentation
				1652	mechanism assumes that all values
				1653	<computeroutput>POP</computeroutput>ped from the stack are
				1654	actually used.</para>
				1655	</listitem>
				1656
				1657	</itemizedlist>
				1658
				1659	<para>Some of the translations may appear to have redundant
				1660	<computeroutput>TempReg</computeroutput>-to-<computeroutput>TempReg</computeroutput>
				1661	moves. This helps the next phase, UCode optimisation, to
				1662	generate better code.</para>
				1663
				1664	</sect2>
				1665
				1666
				1667
				1668	<sect2 id="mc-tech-docs.optim" xreflabel="UCode optimisation">
				1669	<title>UCode optimisation</title>
				1670
				1671	<para>UCode is then subjected to an improvement pass
				1672	(<computeroutput>vg_improve()</computeroutput>), which blurs the
				1673	boundaries between the translations of the original x86
				1674	instructions. It's pretty straightforward. Three
				1675	transformations are done:</para>
				1676
				1677	<itemizedlist>
				1678
				1679	<listitem>
				1680	<para>Redundant <computeroutput>GET</computeroutput>
				1681	elimination. Actually, more general than that -- eliminates
				1682	redundant fetches of ArchRegs. In our running example,
				1683	uinstr 3 <computeroutput>GET</computeroutput>s
				1684	<computeroutput>%EDX</computeroutput> into
				1685	<computeroutput>t2</computeroutput> despite the fact that, by
				1686	looking at the previous uinstr, it is already in
				1687	<computeroutput>t0</computeroutput>. The
				1688	<computeroutput>GET</computeroutput> is therefore removed,
				1689	and <computeroutput>t2</computeroutput> renamed to
				1690	<computeroutput>t0</computeroutput>. Assuming
				1691	<computeroutput>t0</computeroutput> is allocated to a host
				1692	register, it means the simulated
				1693	<computeroutput>%EDX</computeroutput> will exist in a host
				1694	CPU register for more than one simulated x86 instruction,
				1695	which seems to me to be a highly desirable property.</para>
				1696
				1697	<para>There is some mucking around to do with subregisters;
				1698	<computeroutput>%AL</computeroutput> vs
				1699	<computeroutput>%AH</computeroutput>
				1700	<computeroutput>%AX</computeroutput> vs
				1701	<computeroutput>%EAX</computeroutput> etc. I can't remember
				1702	how it works, but in general we are very conservative, and
				1703	these tend to invalidate the caching.</para>
				1704	</listitem>
				1705
				1706	<listitem>
				1707	<para>Redundant <computeroutput>PUT</computeroutput>
				1708	elimination. This annuls
				1709	<computeroutput>PUT</computeroutput>s of values back to
				1710	simulated CPU registers if a later
				1711	<computeroutput>PUT</computeroutput> would overwrite the
				1712	earlier <computeroutput>PUT</computeroutput> value, and there
				1713	is no intervening reads of the simulated register
				1714	(<computeroutput>ArchReg</computeroutput>).</para>
				1715
				1716	<para>As before, we are paranoid when faced with subregister
				1717	references. Also, <computeroutput>PUT</computeroutput>s of
				1718	<computeroutput>%ESP</computeroutput> are never annulled,
				1719	because it is vital the instrumenter always has an up-to-date
				1720	<computeroutput>%ESP</computeroutput> value available,
				1721	<computeroutput>%ESP</computeroutput> changes affect
				1722	addressibility of the memory around the simulated stack
				1723	pointer.</para>
				1724
				1725	<para>The implication of the above paragraph is that the
				1726	simulated machine's registers are only lazily updated once
				1727	the above two optimisation phases have run, with the
				1728	exception of <computeroutput>%ESP</computeroutput>.
				1729	<computeroutput>TempReg</computeroutput>s go dead at the end
				1730	of every basic block, from which is is inferrable that any
				1731	<computeroutput>TempReg</computeroutput> caching a simulated
				1732	CPU reg is flushed (back into the relevant
				1733	<computeroutput>VG_(baseBlock)</computeroutput> slot) at the
				1734	end of every basic block. The further implication is that
				1735	the simulated registers are only up-to-date at in between
				1736	basic blocks, and not at arbitrary points inside basic
				1737	blocks. And the consequence of that is that we can only
				1738	deliver signals to the client in between basic blocks. None
				1739	of this seems any problem in practice.</para>
				1740	</listitem>
				1741
				1742	<listitem>
				1743	<para>Finally there is a simple def-use thing for condition
				1744	codes. If an earlier uinstr writes the condition codes, and
				1745	the next uinsn along which actually cares about the condition
				1746	codes writes the same or larger set of them, but does not
				1747	read any, the earlier uinsn is marked as not writing any
				1748	condition codes. This saves a lot of redundant cond-code
				1749	saving and restoring.</para>
				1750	</listitem>
				1751
				1752	</itemizedlist>
				1753
				1754	<para>The effect of these transformations on our short block is
				1755	rather unexciting, and shown below. On longer basic blocks they
				1756	can dramatically improve code quality.</para>
				1757
				1758	<programlisting><![CDATA[
				1759	at 3: delete GET, rename t2 to t0 in (4 .. 6)
				1760	at 7: delete GET, rename t6 to t0 in (8 .. 9)
				1761	at 1: annul flag write OSZAP due to later OSZACP
				1762
				1763	Improved code:
				1764	0: GETL %EDX, t0
				1765	1: INCL t0
				1766	2: PUTL t0, %EDX
				1767	4: LDB (t0), t0
				1768	5: WIDENL_Bs t0
				1769	6: PUTL t0, %EAX
				1770	8: GETL %ECX, t8
				1771	9: LEA2L 1(t8,t0,2), t4
				1772	10: LDB (t4), t10
				1773	11: MOVB $0x20, t12
				1774	12: ANDB t12, t10 (-wOSZACP)
				1775	13: INCEIPo $9
				1776	14: Jnzo $0x40435A50 (-rOSZACP)
				1777	15: JMPo $0x40435A5B]]></programlisting>
				1778
				1779	</sect2>
				1780
				1781
				1782
				1783	<sect2 id="mc-tech-docs.instrum" xreflabel="UCode instrumentation">
				1784	<title>UCode instrumentation</title>
				1785
				1786	<para>Once you understand the meaning of the instrumentation
				1787	uinstrs, discussed in detail above, the instrumentation scheme is
				1788	fairly straightforward. Each uinstr is instrumented in
				1789	isolation, and the instrumentation uinstrs are placed before the
				1790	original uinstr. Our running example continues below. I have
				1791	placed a blank line after every original ucode, to make it easier
				1792	to see which instrumentation uinstrs correspond to which
				1793	originals.</para>
				1794
				1795	<para>As mentioned somewhere above,
				1796	<computeroutput>TempReg</computeroutput>s carrying values have
				1797	names like <computeroutput>t28</computeroutput>, and each one has
				1798	a shadow carrying its V bits, with names like
				1799	<computeroutput>q28</computeroutput>. This pairing aids in
				1800	reading instrumented ucode.</para>
				1801
				1802	<para>One decision about all this is where to have "observation
				1803	points", that is, where to check that V bits are valid. I use a
				1804	minimalistic scheme, only checking where a failure of validity
				1805	could cause the original program to (seg)fault. So the use of
				1806	values as memory addresses causes a check, as do conditional
				1807	jumps (these cause a check on the definedness of the condition
				1808	codes). And arguments <computeroutput>PUSH</computeroutput>ed
				1809	for helper calls are checked, hence the weird restrictions on
				1810	help call preambles described above.</para>
				1811
				1812	<para>Another decision is that once a value is tested, it is
				1813	thereafter regarded as defined, so that we do not emit multiple
				1814	undefined-value errors for the same undefined value. That means
				1815	that <computeroutput>TESTV</computeroutput> uinstrs are always
				1816	followed by <computeroutput>SETV</computeroutput> on the same
				1817	(shadow) <computeroutput>TempReg</computeroutput>s. Most of
				1818	these <computeroutput>SETV</computeroutput>s are redundant and
				1819	are removed by the post-instrumentation cleanup phase.</para>
				1820
				1821	<para>The instrumentation for calling helper functions deserves
				1822	further comment. The definedness of results from a helper is
				1823	modelled using just one V bit. So, in short, we do pessimising
				1824	casts of the definedness of all the args, down to a single bit,
				1825	and then <computeroutput>UifU</computeroutput> these bits
				1826	together. So this single V bit will say "undefined" if any part
				1827	of any arg is undefined. This V bit is then pessimally cast back
				1828	up to the result(s) sizes, as needed. If, by seeing that all the
				1829	args are got rid of with <computeroutput>CLEAR</computeroutput>
				1830	and none with <computeroutput>POP</computeroutput>, Valgrind sees
				1831	that the result of the call is not actually used, it immediately
				1832	examines the result V bit with a
				1833	<computeroutput>TESTV</computeroutput> --
				1834	<computeroutput>SETV</computeroutput> pair. If it did not do
				1835	this, there would be no observation point to detect that the some
				1836	of the args to the helper were undefined. Of course, if the
				1837	helper's results are indeed used, we don't do this, since the
				1838	result usage will presumably cause the result definedness to be
				1839	checked at some suitable future point.</para>
				1840
				1841	<para>In general Valgrind tries to track definedness on a
				1842	bit-for-bit basis, but as the above para shows, for calls to
				1843	helpers we throw in the towel and approximate down to a single
				1844	bit. This is because it's too complex and difficult to track
				1845	bit-level definedness through complex ops such as integer
				1846	multiply and divide, and in any case there is no reasonable code
				1847	fragments which attempt to (eg) multiply two partially-defined
				1848	values and end up with something meaningful, so there seems
				1849	little point in modelling multiplies, divides, etc, in that level
				1850	of detail.</para>
				1851
				1852	<para>Integer loads and stores are instrumented with firstly a
				1853	test of the definedness of the address, followed by a
				1854	<computeroutput>LOADV</computeroutput> or
				1855	<computeroutput>STOREV</computeroutput> respectively. These turn
				1856	into calls to (for example)
				1857	<computeroutput>VG_(helperc_LOADV4)</computeroutput>. These
				1858	helpers do two things: they perform an address-valid check, and
				1859	they load or store V bits from/to the relevant address in the
				1860	(simulated V-bit) memory.</para>
				1861
				1862	<para>FPU loads and stores are different. As above the
				1863	definedness of the address is first tested. However, the helper
				1864	routine for FPU loads
				1865	(<computeroutput>VGM_(fpu_read_check)</computeroutput>) emits an
				1866	error if either the address is invalid or the referenced area
				1867	contains undefined values. It has to do this because we do not
				1868	simulate the FPU at all, and so cannot track definedness of
				1869	values loaded into it from memory, so we have to check them as
				1870	soon as they are loaded into the FPU, ie, at this point. We
				1871	notionally assume that everything in the FPU is defined.</para>
				1872
				1873	<para>It follows therefore that FPU writes first check the
				1874	definedness of the address, then the validity of the address, and
				1875	finally mark the written bytes as well-defined.</para>
				1876
				1877	<para>If anyone is inspired to extend Valgrind to MMX/SSE insns,
				1878	I suggest you use the same trick. It works provided that the
				1879	FPU/MMX unit is not used to merely as a conduit to copy partially
				1880	undefined data from one place in memory to another.
				1881	Unfortunately the integer CPU is used like that (when copying C
				1882	structs with holes, for example) and this is the cause of much of
				1883	the elaborateness of the instrumentation here described.</para>
				1884
				1885	<para><computeroutput>vg_instrument()</computeroutput> in
				1886	<filename>vg_translate.c</filename> actually does the
				1887	instrumentation. There are comments explaining how each uinstr
				1888	is handled, so we do not repeat that here. As explained already,
				1889	it is bit-accurate, except for calls to helper functions.
				1890	Unfortunately the x86 insns
				1891	<computeroutput>bt/bts/btc/btr</computeroutput> are done by
				1892	helper fns, so bit-level accuracy is lost there. This should be
				1893	fixed by doing them inline; it will probably require adding a
				1894	couple new uinstrs. Also, left and right rotates through the
				1895	carry flag (x86 <computeroutput>rcl</computeroutput> and
				1896	<computeroutput>rcr</computeroutput>) are approximated via a
				1897	single V bit; so far this has not caused anyone to complain. The
				1898	non-carry rotates, <computeroutput>rol</computeroutput> and
				1899	<computeroutput>ror</computeroutput>, are much more common and
				1900	are done exactly. Re-visiting the instrumentation for AND and
				1901	OR, they seem rather verbose, and I wonder if it could be done
				1902	more concisely now.</para>
				1903
				1904	<para>The lowercase <computeroutput>o</computeroutput> on many of
				1905	the uopcodes in the running example indicates that the size field
				1906	is zero, usually meaning a single-bit operation.</para>
				1907
				1908	<para>Anyroads, the post-instrumented version of our running
				1909	example looks like this:</para>
				1910
				1911	<programlisting><![CDATA[
				1912	Instrumented code:
				1913	0: GETVL %EDX, q0
				1914	1: GETL %EDX, t0
				1915
				1916	2: TAG1o q0 = Left4 ( q0 )
				1917	3: INCL t0
				1918
				1919	4: PUTVL q0, %EDX
				1920	5: PUTL t0, %EDX
				1921
				1922	6: TESTVL q0
				1923	7: SETVL q0
				1924	8: LOADVB (t0), q0
				1925	9: LDB (t0), t0
				1926
				1927	10: TAG1o q0 = SWiden14 ( q0 )
				1928	11: WIDENL_Bs t0
				1929
				1930	12: PUTVL q0, %EAX
				1931	13: PUTL t0, %EAX
				1932
				1933	14: GETVL %ECX, q8
				1934	15: GETL %ECX, t8
				1935
				1936	16: MOVL q0, q4
				1937	17: SHLL $0x1, q4
				1938	18: TAG2o q4 = UifU4 ( q8, q4 )
				1939	19: TAG1o q4 = Left4 ( q4 )
				1940	20: LEA2L 1(t8,t0,2), t4
				1941
				1942	21: TESTVL q4
				1943	22: SETVL q4
				1944	23: LOADVB (t4), q10
				1945	24: LDB (t4), t10
				1946
				1947	25: SETVB q12
				1948	26: MOVB $0x20, t12
				1949
				1950	27: MOVL q10, q14
				1951	28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
				1952	29: TAG2o q10 = UifU1 ( q12, q10 )
				1953	30: TAG2o q10 = DifD1 ( q14, q10 )
				1954	31: MOVL q12, q14
				1955	32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 )
				1956	33: TAG2o q10 = DifD1 ( q14, q10 )
				1957	34: MOVL q10, q16
				1958	35: TAG1o q16 = PCast10 ( q16 )
				1959	36: PUTVFo q16
				1960	37: ANDB t12, t10 (-wOSZACP)
				1961
				1962	38: INCEIPo $9
				1963
				1964	39: GETVFo q18
				1965	40: TESTVo q18
				1966	41: SETVo q18
				1967	42: Jnzo $0x40435A50 (-rOSZACP)
				1968
				1969	43: JMPo $0x40435A5B]]></programlisting>
				1970
				1971	</sect2>
				1972
				1973
				1974
				1975	<sect2 id="mc-tech-docs.cleanup"
				1976	xreflabel="UCode post-instrumentation cleanup">
				1977	<title>UCode post-instrumentation cleanup</title>
				1978
				1979	<para>This pass, coordinated by
				1980	<computeroutput>vg_cleanup()</computeroutput>, removes redundant
				1981	definedness computation created by the simplistic instrumentation
				1982	pass. It consists of two passes,
				1983	<computeroutput>vg_propagate_definedness()</computeroutput>
				1984	followed by
				1985	<computeroutput>vg_delete_redundant_SETVs</computeroutput>.</para>
				1986
				1987	<para><computeroutput>vg_propagate_definedness()</computeroutput>
				1988	is a simple constant-propagation and constant-folding pass. It
				1989	tries to determine which
				1990	<computeroutput>TempReg</computeroutput>s containing V bits will
				1991	always indicate "fully defined", and it propagates this
				1992	information as far as it can, and folds out as many operations as
				1993	possible. For example, the instrumentation for an ADD of a
				1994	literal to a variable quantity will be reduced down so that the
				1995	definedness of the result is simply the definedness of the
				1996	variable quantity, since the literal is by definition fully
				1997	defined.</para>
				1998
				1999	<para><computeroutput>vg_delete_redundant_SETVs</computeroutput>
				2000	removes <computeroutput>SETV</computeroutput>s on shadow
				2001	<computeroutput>TempReg</computeroutput>s for which the next
				2002	action is a write. I don't think there's anything else worth
				2003	saying about this; it is simple. Read the sources for
				2004	details.</para>
				2005
				2006	<para>So the cleaned-up running example looks like this. As
				2007	above, I have inserted line breaks after every original
				2008	(non-instrumentation) uinstr to aid readability. As with
				2009	straightforward ucode optimisation, the results in this block are
				2010	undramatic because it is so short; longer blocks benefit more
				2011	because they have more redundancy which gets eliminated.</para>
				2012
				2013	<programlisting><![CDATA[
				2014	at 29: delete UifU1 due to defd arg1
				2015	at 32: change ImproveAND1_TQ to MOV due to defd arg2
				2016	at 41: delete SETV
				2017	at 31: delete MOV
				2018	at 25: delete SETV
				2019	at 22: delete SETV
				2020	at 7: delete SETV
				2021
				2022	0: GETVL %EDX, q0
				2023	1: GETL %EDX, t0
				2024
				2025	2: TAG1o q0 = Left4 ( q0 )
				2026	3: INCL t0
				2027
				2028	4: PUTVL q0, %EDX
				2029	5: PUTL t0, %EDX
				2030
				2031	6: TESTVL q0
				2032	8: LOADVB (t0), q0
				2033	9: LDB (t0), t0
				2034
				2035	10: TAG1o q0 = SWiden14 ( q0 )
				2036	11: WIDENL_Bs t0
				2037
				2038	12: PUTVL q0, %EAX
				2039	13: PUTL t0, %EAX
				2040
				2041	14: GETVL %ECX, q8
				2042	15: GETL %ECX, t8
				2043
				2044	16: MOVL q0, q4
				2045	17: SHLL $0x1, q4
				2046	18: TAG2o q4 = UifU4 ( q8, q4 )
				2047	19: TAG1o q4 = Left4 ( q4 )
				2048	20: LEA2L 1(t8,t0,2), t4
				2049
				2050	21: TESTVL q4
				2051	23: LOADVB (t4), q10
				2052	24: LDB (t4), t10
				2053
				2054	26: MOVB $0x20, t12
				2055
				2056	27: MOVL q10, q14
				2057	28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
				2058	30: TAG2o q10 = DifD1 ( q14, q10 )
				2059	32: MOVL t12, q14
				2060	33: TAG2o q10 = DifD1 ( q14, q10 )
				2061	34: MOVL q10, q16
				2062	35: TAG1o q16 = PCast10 ( q16 )
				2063	36: PUTVFo q16
				2064	37: ANDB t12, t10 (-wOSZACP)
				2065
				2066	38: INCEIPo $9
				2067	39: GETVFo q18
				2068	40: TESTVo q18
				2069	42: Jnzo $0x40435A50 (-rOSZACP)
				2070
				2071	43: JMPo $0x40435A5B]]></programlisting>
				2072
				2073	</sect2>
				2074
				2075
				2076
				2077	<sect2 id="mc-tech-docs.transfrom" xreflabel="Translation from UCode">
				2078	<title>Translation from UCode</title>
				2079
				2080	<para>This is all very simple, even though
				2081	<filename>vg_from_ucode.c</filename> is a big file.
				2082	Position-independent x86 code is generated into a dynamically
				2083	allocated array <computeroutput>emitted_code</computeroutput>;
				2084	this is doubled in size when it overflows. Eventually the array
				2085	is handed back to the caller of
				2086	<computeroutput>VG_(translate)</computeroutput>, who must copy
				2087	the result into TC and TT, and free the array.</para>
				2088
				2089	<para>This file is structured into four layers of abstraction,
				2090	which, thankfully, are glued back together with extensive
				2091	<computeroutput>__inline__</computeroutput> directives. From the
				2092	bottom upwards:</para>
				2093
				2094	<itemizedlist>
				2095
				2096	<listitem>
				2097	<para>Address-mode emitters,
				2098	<computeroutput>emit_amode_regmem_reg</computeroutput> et
				2099	al.</para>
				2100	</listitem>
				2101
				2102	<listitem>
				2103	<para>Emitters for specific x86 instructions. There are
				2104	quite a lot of these, with names such as
				2105	<computeroutput>emit_movv_offregmem_reg</computeroutput>.
				2106	The <computeroutput>v</computeroutput> suffix is Intel
				2107	parlance for a 16/32 bit insn; there are also
				2108	<computeroutput>b</computeroutput> suffixes for 8 bit
				2109	insns.</para>
				2110	</listitem>
				2111
				2112	<listitem>
				2113	<para>The next level up are the
				2114	<computeroutput>synth_*</computeroutput> functions, which
				2115	synthesise possibly a sequence of raw x86 instructions to do
				2116	some simple task. Some of these are quite complex because
				2117	they have to work around Intel's silly restrictions on
				2118	subregister naming. See
				2119	<computeroutput>synth_nonshiftop_reg_reg</computeroutput> for
				2120	example.</para>
				2121	</listitem>
				2122
				2123	<listitem>
				2124	<para>Finally, at the top of the heap, we have
				2125	<computeroutput>emitUInstr()</computeroutput>, which emits
				2126	code for a single uinstr.</para>
				2127	</listitem>
				2128
				2129	</itemizedlist>
				2130
				2131	<para>Some comments:</para>
				2132
				2133	<itemizedlist>
				2134
				2135	<listitem>
				2136	<para>The hack for FPU instructions becomes apparent here.
				2137	To do a <computeroutput>FPU</computeroutput> ucode
				2138	instruction, we load the simulated FPU's state into from its
				2139	<computeroutput>VG_(baseBlock)</computeroutput> into the real
				2140	FPU using an x86 <computeroutput>frstor</computeroutput>
				2141	insn, do the ucode <computeroutput>FPU</computeroutput> insn
				2142	on the real CPU, and write the updated FPU state back into
				2143	<computeroutput>VG_(baseBlock)</computeroutput> using an
				2144	<computeroutput>fnsave</computeroutput> instruction. This is
				2145	pretty brutal, but is simple and it works, and even seems
				2146	tolerably efficient. There is no attempt to cache the
				2147	simulated FPU state in the real FPU over multiple
				2148	back-to-back ucode FPU instructions.</para>
				2149
				2150	<para><computeroutput>FPU_R</computeroutput> and
				2151	<computeroutput>FPU_W</computeroutput> are also done this
				2152	way, with the minor complication that we need to patch in
				2153	some addressing mode bits so the resulting insn knows the
				2154	effective address to use. This is easy because of the
				2155	regularity of the x86 FPU instruction encodings.</para>
				2156	</listitem>
				2157
				2158	<listitem>
				2159	<para>An analogous trick is done with ucode insns which
				2160	claim, in their <computeroutput>flags_r</computeroutput> and
				2161	<computeroutput>flags_w</computeroutput> fields, that they
				2162	read or write the simulated
				2163	<computeroutput>%EFLAGS</computeroutput>. For such cases we
				2164	first copy the simulated
				2165	<computeroutput>%EFLAGS</computeroutput> into the real
				2166	<computeroutput>%eflags</computeroutput>, then do the insn,
				2167	then, if the insn says it writes the flags, copy back to
				2168	<computeroutput>%EFLAGS</computeroutput>. This is a bit
				2169	expensive, which is why the ucode optimisation pass goes to
				2170	some effort to remove redundant flag-update annotations.</para>
				2171	</listitem>
				2172
				2173	</itemizedlist>
				2174
				2175	<para>And so ... that's the end of the documentation for the
				2176	instrumentating translator! It's really not that complex,
				2177	because it's composed as a sequence of simple(ish) self-contained
				2178	transformations on straight-line blocks of code.</para>
				2179
				2180	</sect2>
				2181
				2182
				2183
				2184	<sect2 id="mc-tech-docs.dispatch" xreflabel="Top-level dispatch loop">
				2185	<title>Top-level dispatch loop</title>
				2186
				2187	<para>Urk. In <computeroutput>VG_(toploop)</computeroutput>.
				2188	This is basically boring and unsurprising, not to mention fiddly
				2189	and fragile. It needs to be cleaned up.</para>
				2190
				2191	<para>The only perhaps surprise is that the whole thing is run on
				2192	top of a <computeroutput>setjmp</computeroutput>-installed
				2193	exception handler, because, supposing a translation got a
				2194	segfault, we have to bail out of the Valgrind-supplied exception
				2195	handler <computeroutput>VG_(oursignalhandler)</computeroutput>
				2196	and immediately start running the client's segfault handler, if
				2197	it has one. In particular we can't finish the current basic
				2198	block and then deliver the signal at some convenient future
				2199	point, because signals like SIGILL, SIGSEGV and SIGBUS mean that
				2200	the faulting insn should not simply be re-tried. (I'm sure there
				2201	is a clearer way to explain this).</para>
				2202
				2203	</sect2>
				2204
				2205
				2206
				2207	<sect2 id="mc-tech-docs.lazy"
				2208	xreflabel="Lazy updates of the simulated program counter">
				2209	<title>Lazy updates of the simulated program counter</title>
				2210
				2211	<para>Simulated <computeroutput>%EIP</computeroutput> is not
				2212	updated after every simulated x86 insn as this was regarded as
				2213	too expensive. Instead ucode
				2214	<computeroutput>INCEIP</computeroutput> insns move it along as
				2215	and when necessary. Currently we don't allow it to fall more
				2216	than 4 bytes behind reality (see
				2217	<computeroutput>VG_(disBB)</computeroutput> for the way this
				2218	works).</para>
				2219
				2220	<para>Note that <computeroutput>%EIP</computeroutput> is always
				2221	brought up to date by the inner dispatch loop in
				2222	<computeroutput>VG_(dispatch)</computeroutput>, so that if the
				2223	client takes a fault we know at least which basic block this
				2224	happened in.</para>
				2225
				2226	</sect2>
				2227
				2228
				2229
				2230	<sect2 id="mc-tech-docs.signals" xreflabel="Signals">
				2231	<title>Signals</title>
				2232
				2233	<para>Horrible, horrible. <filename>vg_signals.c</filename>.
				2234	Basically, since we have to intercept all system calls anyway, we
				2235	can see when the client tries to install a signal handler. If it
				2236	does so, we make a note of what the client asked to happen, and
				2237	ask the kernel to route the signal to our own signal handler,
				2238	<computeroutput>VG_(oursignalhandler)</computeroutput>. This
				2239	simply notes the delivery of signals, and returns.</para>
				2240
				2241	<para>Every 1000 basic blocks, we see if more signals have
				2242	arrived. If so,
				2243	<computeroutput>VG_(deliver_signals)</computeroutput> builds
				2244	signal delivery frames on the client's stack, and allows their
				2245	handlers to be run. Valgrind places in these signal delivery
				2246	frames a bogus return address,
				2247	<computeroutput>VG_(signalreturn_bogusRA)</computeroutput>, and
				2248	checks all jumps to see if any jump to it. If so, this is a sign
				2249	that a signal handler is returning, and if so Valgrind removes
				2250	the relevant signal frame from the client's stack, restores the
				2251	from the signal frame the simulated state before the signal was
				2252	delivered, and allows the client to run onwards. We have to do
				2253	it this way because some signal handlers never return, they just
				2254	<computeroutput>longjmp()</computeroutput>, which nukes the
				2255	signal delivery frame.</para>
				2256
				2257	<para>The Linux kernel has a different but equally horrible hack
				2258	for detecting signal handler returns. Discovering it is left as
				2259	an exercise for the reader.</para>
				2260
				2261	</sect2>
				2262
				2263
				2264	<sect2 id="mc-tech-docs.todo">
				2265	<title>To be written</title>
				2266
				2267	<para>The following is a list of as-yet-not-written stuff. Apologies.</para>
				2268	<orderedlist>
				2269	<listitem>
				2270	<para>The translation cache and translation table</para>
				2271	</listitem>
				2272	<listitem>
				2273	<para>Exceptions, creating new translations</para>
				2274	</listitem>
				2275	<listitem>
				2276	<para>Self-modifying code</para>
				2277	</listitem>
				2278	<listitem>
				2279	<para>Errors, error contexts, error reporting, suppressions</para>
				2280	</listitem>
				2281	<listitem>
				2282	<para>Client malloc/free</para>
				2283	</listitem>
				2284	<listitem>
				2285	<para>Low-level memory management</para>
				2286	</listitem>
				2287	<listitem>
				2288	<para>A and V bitmaps</para>
				2289	</listitem>
				2290	<listitem>
				2291	<para>Symbol table management</para>
				2292	</listitem>
				2293	<listitem>
				2294	<para>Dealing with system calls</para>
				2295	</listitem>
				2296	<listitem>
				2297	<para>Namespace management</para>
				2298	</listitem>
				2299	<listitem>
				2300	<para>GDB attaching</para>
				2301	</listitem>
				2302	<listitem>
				2303	<para>Non-dependence on glibc or anything else</para>
				2304	</listitem>
				2305	<listitem>
				2306	<para>The leak detector</para>
				2307	</listitem>
				2308	<listitem>
				2309	<para>Performance problems</para>
				2310	</listitem>
				2311	<listitem>
				2312	<para>Continuous sanity checking</para>
				2313	</listitem>
				2314	<listitem>
				2315	<para>Tracing, or not tracing, child processes</para>
				2316	</listitem>
				2317	<listitem>
				2318	<para>Assembly glue for syscalls</para>
				2319	</listitem>
				2320	</orderedlist>
				2321
				2322	</sect2>
				2323
				2324	</sect1>
				2325
				2326
				2327
				2328
				2329	<sect1 id="mc-tech-docs.extensions" xreflabel="Extensions">
				2330	<title>Extensions</title>
				2331
				2332	<para>Some comments about Stuff To Do.</para>
				2333
				2334	<sect2 id="mc-tech-docs.bugs" xreflabel="Bugs">
				2335	<title>Bugs</title>
				2336
				2337	<para>Stephan Kulow and Marc Mutz report problems with kmail in
				2338	KDE 3 CVS (RC2 ish) when run on Valgrind. Stephan has it
				2339	deadlocking; Marc has it looping at startup. I can't repro
				2340	either behaviour. Needs repro-ing and fixing.</para>
				2341
				2342	</sect2>
				2343
				2344
				2345	<sect2 id="mc-tech-docs.threads" xreflabel="Threads">
				2346	<title>Threads</title>
				2347
				2348	<para>Doing a good job of thread support strikes me as almost a
				2349	research-level problem. The central issues are how to do fast
				2350	cheap locking of the
				2351	<computeroutput>VG_(primary_map)</computeroutput> structure,
				2352	whether or not accesses to the individual secondary maps need
				2353	locking, what race-condition issues result, and whether the
				2354	already-nasty mess that is the signal simulator needs further
				2355	hackery.</para>
				2356
				2357	<para>I realise that threads are the most-frequently-requested
				2358	feature, and I am thinking about it all. If you have guru-level
				2359	understanding of fast mutual exclusion mechanisms and race
				2360	conditions, I would be interested in hearing from you.</para>
				2361
				2362	</sect2>
				2363
				2364
				2365
				2366	<sect2 id="mc-tech-docs.verify" xreflabel="Verification suite">
				2367	<title>Verification suite</title>
				2368
				2369	<para>Directory <computeroutput>tests/</computeroutput> contains
				2370	various ad-hoc tests for Valgrind. However, there is no
				2371	systematic verification or regression suite, that, for example,
				2372	exercises all the stuff in <filename>vg_memory.c</filename>, to
				2373	ensure that illegal memory accesses and undefined value uses are
				2374	detected as they should be. It would be good to have such a
				2375	suite.</para>
				2376
				2377	</sect2>
				2378
				2379
				2380	<sect2 id="mc-tech-docs.porting" xreflabel="Porting to other platforms">
				2381	<title>Porting to other platforms</title>
				2382
				2383	<para>It would be great if Valgrind was ported to FreeBSD and x86
				2384	NetBSD, and to x86 OpenBSD, if it's possible (doesn't OpenBSD use
				2385	a.out-style executables, not ELF ?)</para>
				2386
				2387	<para>The main difficulties, for an x86-ELF platform, seem to
				2388	be:</para>
				2389
				2390	<itemizedlist>
				2391
				2392	<listitem>
				2393	<para>You'd need to rewrite the
				2394	<computeroutput>/proc/self/maps</computeroutput> parser
				2395	(<filename>vg_procselfmaps.c</filename>). Easy.</para>
				2396	</listitem>
				2397
				2398	<listitem>
				2399	<para>You'd need to rewrite
				2400	<filename>vg_syscall_mem.c</filename>, or, more specifically,
				2401	provide one for your OS. This is tedious, but you can
				2402	implement syscalls on demand, and the Linux kernel interface
				2403	is, for the most part, going to look very similar to the *BSD
				2404	interfaces, so it's really a copy-paste-and-modify-on-demand
				2405	job. As part of this, you'd need to supply a new
				2406	<filename>vg_kerneliface.h</filename> file.</para>
				2407	</listitem>
				2408
				2409	<listitem>
				2410	<para>You'd also need to change the syscall wrappers for
				2411	Valgrind's internal use, in
				2412	<filename>vg_mylibc.c</filename>.</para>
				2413	</listitem>
				2414
				2415	</itemizedlist>
				2416
				2417	<para>All in all, I think a port to x86-ELF *BSDs is not really
				2418	very difficult, and in some ways I would like to see it happen,
				2419	because that would force a more clear factoring of Valgrind into
				2420	platform dependent and independent pieces. Not to mention, *BSD
				2421	folks also deserve to use Valgrind just as much as the Linux crew
				2422	do.</para>
				2423
				2424	</sect2>
				2425
				2426	</sect1>
				2427
				2428
				2429
				2430	<sect1 id="mc-tech-docs.easystuff"
				2431	xreflabel="Easy stuff which ought to be done">
				2432	<title>Easy stuff which ought to be done</title>
				2433
				2434
				2435	<sect2 id="mc-tech-docs.mmx" xreflabel="MMX Instructions">
				2436	<title>MMX Instructions</title>
				2437
				2438	<para>MMX insns should be supported, using the same trick as for
				2439	FPU insns. If the MMX registers are not used to copy
				2440	uninitialised junk from one place to another in memory, this
				2441	means we don't have to actually simulate the internal MMX unit
				2442	state, so the FPU hack applies. This should be fairly
				2443	easy.</para>
				2444
				2445	</sect2>
				2446
				2447
				2448	<sect2 id="mc-tech-docs.fixstabs" xreflabel="Fix stabs-info Reader">
				2449	<title>Fix stabs-info reader</title>
				2450
				2451	<para>The machinery in <filename>vg_symtab2.c</filename> which
				2452	reads "stabs" style debugging info is pretty weak. It usually
				2453	correctly translates simulated program counter values into line
				2454	numbers and procedure names, but the file name is often
				2455	completely wrong. I think the logic used to parse "stabs"
				2456	entries is weak. It should be fixed. The simplest solution,
				2457	IMO, is to copy either the logic or simply the code out of GNU
				2458	binutils which does this; since GDB can clearly get it right,
				2459	binutils (or GDB?) must have code to do this somewhere.</para>
				2460
				2461	</sect2>
				2462
				2463
				2464
				2465	<sect2 id="mc-tech-docs.x86instr" xreflabel="BT/BTC/BTS/BTR">
				2466	<title>BT/BTC/BTS/BTR</title>
				2467
				2468	<para>These are x86 instructions which test, complement, set, or
				2469	reset, a single bit in a word. At the moment they are both
				2470	incorrectly implemented and incorrectly instrumented.</para>
				2471
				2472	<para>The incorrect instrumentation is due to use of helper
				2473	functions. This means we lose bit-level definedness tracking,
				2474	which could wind up giving spurious uninitialised-value use
				2475	errors. The Right Thing to do is to invent a couple of new
				2476	UOpcodes, I think <computeroutput>GET_BIT</computeroutput> and
				2477	<computeroutput>SET_BIT</computeroutput>, which can be used to
				2478	implement all 4 x86 insns, get rid of the helpers, and give
				2479	bit-accurate instrumentation rules for the two new
				2480	UOpcodes.</para>
				2481
				2482	<para>I realised the other day that they are mis-implemented too.
				2483	The x86 insns take a bit-index and a register or memory location
				2484	to access. For registers the bit index clearly can only be in
				2485	the range zero to register-width minus 1, and I assumed the same
				2486	applied to memory locations too. But evidently not; for memory
				2487	locations the index can be arbitrary, and the processor will
				2488	index arbitrarily into memory as a result. This too should be
				2489	fixed. Sigh. Presumably indexing outside the immediate word is
				2490	not actually used by any programs yet tested on Valgrind, for
				2491	otherwise they (presumably) would simply not work at all. If you
				2492	plan to hack on this, first check the Intel docs to make sure my
				2493	understanding is really correct.</para>
				2494
				2495	</sect2>
				2496
				2497
				2498	<sect2 id="mc-tech-docs.prefetch" xreflabel="Using PREFETCH Instructions">
				2499	<title>Using PREFETCH Instructions</title>
				2500
				2501	<para>Here's a small but potentially interesting project for
				2502	performance junkies. Experiments with valgrind's code generator
				2503	and optimiser(s) suggest that reducing the number of instructions
				2504	executed in the translations and mem-check helpers gives
				2505	disappointingly small performance improvements. Perhaps this is
				2506	because performance of Valgrindified code is limited by cache
				2507	misses. After all, each read in the original program now gives
				2508	rise to at least three reads, one for the
				2509	<computeroutput>VG_(primary_map)</computeroutput>, one of the
				2510	resulting secondary, and the original. Not to mention, the
				2511	instrumented translations are 13 to 14 times larger than the
				2512	originals. All in all one would expect the memory system to be
				2513	hammered to hell and then some.</para>
				2514
				2515	<para>So here's an idea. An x86 insn involving a read from
				2516	memory, after instrumentation, will turn into ucode of the
				2517	following form:</para>
				2518	<programlisting><![CDATA[
				2519	... calculate effective addr, into ta and qa ...
				2520	TESTVL qa -- is the addr defined?
				2521	LOADV (ta), qloaded -- fetch V bits for the addr
				2522	LOAD (ta), tloaded -- do the original load]]></programlisting>
				2523
				2524	<para>At the point where the
				2525	<computeroutput>LOADV</computeroutput> is done, we know the
				2526	actual address (<computeroutput>ta</computeroutput>) from which
				2527	the real <computeroutput>LOAD</computeroutput> will be done. We
				2528	also know that the <computeroutput>LOADV</computeroutput> will
				2529	take around 20 x86 insns to do. So it seems plausible that doing
				2530	a prefetch of <computeroutput>ta</computeroutput> just before the
				2531	<computeroutput>LOADV</computeroutput> might just avoid a miss at
				2532	the <computeroutput>LOAD</computeroutput> point, and that might
				2533	be a significant performance win.</para>
				2534
				2535	<para>Prefetch insns are notoriously tempermental, more often
				2536	than not making things worse rather than better, so this would
				2537	require considerable fiddling around. It's complicated because
				2538	Intels and AMDs have different prefetch insns with different
				2539	semantics, so that too needs to be taken into account. As a
				2540	general rule, even placing the prefetches before the
				2541	<computeroutput>LOADV</computeroutput> insn is too near the
				2542	<computeroutput>LOAD</computeroutput>; the ideal distance is
				2543	apparently circa 200 CPU cycles. So it might be worth having
				2544	another analysis/transformation pass which pushes prefetches as
				2545	far back as possible, hopefully immediately after the effective
				2546	address becomes available.</para>
				2547
				2548	<para>Doing too many prefetches is also bad because they soak up
				2549	bus bandwidth / cpu resources, so some cleverness in deciding
				2550	which loads to prefetch and which to not might be helpful. One
				2551	can imagine not prefetching client-stack-relative
				2552	(<computeroutput>%EBP</computeroutput> or
				2553	<computeroutput>%ESP</computeroutput>) accesses, since the stack
				2554	in general tends to show good locality anyway.</para>
				2555
				2556	<para>There's quite a lot of experimentation to do here, but I
				2557	think it might make an interesting week's work for
				2558	someone.</para>
				2559
				2560	<para>As of 15-ish March 2002, I've started to experiment with
				2561	this, using the AMD
				2562	<computeroutput>prefetch/prefetchw</computeroutput> insns.</para>
				2563
				2564	</sect2>
				2565
				2566
				2567	<sect2 id="mc-tech-docs.pranges" xreflabel="User-defined Permission Ranges">
				2568	<title>User-defined Permission Ranges</title>
				2569
				2570	<para>This is quite a large project -- perhaps a month's hacking
				2571	for a capable hacker to do a good job -- but it's potentially
				2572	very interesting. The outcome would be that Valgrind could
				2573	detect a whole class of bugs which it currently cannot.</para>
				2574
				2575	<para>The presentation falls into two pieces.</para>
				2576
				2577	<sect3 id="mc-tech-docs.psetting"
				2578	xreflabel="Part 1: User-defined Address-range Permission Setting">
				2579	<title>Part 1: User-defined Address-range Permission Setting</title>
				2580
				2581	<para>Valgrind intercepts the client's
				2582	<computeroutput>malloc</computeroutput>,
				2583	<computeroutput>free</computeroutput>, etc calls, watches system
				2584	calls, and watches the stack pointer move. This is currently the
				2585	only way it knows about which addresses are valid and which not.
				2586	Sometimes the client program knows extra information about its
				2587	memory areas. For example, the client could at some point know
				2588	that all elements of an array are out-of-date. We would like to
				2589	be able to convey to Valgrind this information that the array is
				2590	now addressable-but-uninitialised, so that Valgrind can then warn
				2591	if elements are used before they get new values.</para>
				2592
				2593	<para>What I would like are some macros like this:</para>
				2594	<programlisting><![CDATA[
				2595	VALGRIND_MAKE_NOACCESS(addr, len)
				2596	VALGRIND_MAKE_WRITABLE(addr, len)
				2597	VALGRIND_MAKE_READABLE(addr, len)]]></programlisting>
				2598
				2599	<para>and also, to check that memory is
				2600	addressible/initialised,</para>
				2601	<programlisting><![CDATA[
				2602	VALGRIND_CHECK_ADDRESSIBLE(addr, len)
				2603	VALGRIND_CHECK_INITIALISED(addr, len)]]></programlisting>
				2604
				2605	<para>I then include in my sources a header defining these
				2606	macros, rebuild my app, run under Valgrind, and get user-defined
				2607	checks.</para>
				2608
				2609	<para>Now here's a neat trick. It's a nuisance to have to
				2610	re-link the app with some new library which implements the above
				2611	macros. So the idea is to define the macros so that the
				2612	resulting executable is still completely stand-alone, and can be
				2613	run without Valgrind, in which case the macros do nothing, but
				2614	when run on Valgrind, the Right Thing happens. How to do this?
				2615	The idea is for these macros to turn into a piece of inline
				2616	assembly code, which (1) has no effect when run on the real CPU,
				2617	(2) is easily spotted by Valgrind's JITter, and (3) no sane
				2618	person would ever write, which is important for avoiding false
				2619	matches in (2). So here's a suggestion:</para>
				2620	<programlisting><![CDATA[
				2621	VALGRIND_MAKE_NOACCESS(addr, len)]]></programlisting>
				2622
				2623	<para>becomes (roughly speaking)</para>
				2624	<programlisting><![CDATA[
				2625	movl addr, %eax
				2626	movl len, %ebx
				2627	movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be
				2628	-- 2, etc
				2629	rorl $13, %ecx
				2630	rorl $19, %ecx
				2631	rorl $11, %eax
				2632	rorl $21, %eax]]></programlisting>
				2633
				2634	<para>The rotate sequences have no effect, and it's unlikely they
				2635	would appear for any other reason, but they define a unique
				2636	byte-sequence which the JITter can easily spot. Using the
				2637	operand constraints section at the end of a gcc inline-assembly
				2638	statement, we can tell gcc that the assembly fragment kills
				2639	<computeroutput>%eax</computeroutput>,
				2640	<computeroutput>%ebx</computeroutput>,
				2641	<computeroutput>%ecx</computeroutput> and the condition codes, so
				2642	this fragment is made harmless when not running on Valgrind, runs
				2643	quickly when not on Valgrind, and does not require any other
				2644	library support.</para>
				2645
				2646
				2647	</sect3>
				2648
				2649
				2650	<sect3 id="mc-tech-docs.prange-detect"
				2651	xreflabel="Part 2: Using it to detect Interference between Stack
				2652	Variables">
				2653	<title>Part 2: Using it to detect Interference between Stack
				2654	Variables</title>
				2655
				2656	<para>Currently Valgrind cannot detect errors of the following
				2657	form:</para>
				2658	<programlisting><![CDATA[
				2659	void fooble ( void )
				2660	{
				2661	int a[10];
				2662	int b[10];
				2663	a[10] = 99;
				2664	}]]></programlisting>
				2665
				2666	<para>Now imagine rewriting this as</para>
				2667	<programlisting><![CDATA[
				2668	void fooble ( void )
				2669	{
				2670	int spacer0;
				2671	int a[10];
				2672	int spacer1;
				2673	int b[10];
				2674	int spacer2;
				2675	VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
				2676	VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
				2677	VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
				2678	a[10] = 99;
				2679	}]]></programlisting>
				2680
				2681	<para>Now the invalid write is certain to hit
				2682	<computeroutput>spacer0</computeroutput> or
				2683	<computeroutput>spacer1</computeroutput>, so Valgrind will spot
				2684	the error.</para>
				2685
				2686	<para>There are two complications.</para>
				2687
				2688	<orderedlist>
				2689
				2690	<listitem>
				2691	<para>The first is that we don't want to annotate sources by
				2692	hand, so the Right Thing to do is to write a C/C++ parser,
				2693	annotator, prettyprinter which does this automatically, and
				2694	run it on post-CPP'd C/C++ source. See
				2695	http://www.cacheprof.org for an example of a system which
				2696	transparently inserts another phase into the gcc/g++
				2697	compilation route. The parser/prettyprinter is probably not
				2698	as hard as it sounds; I would write it in Haskell, a powerful
				2699	functional language well suited to doing symbolic
				2700	computation, with which I am intimately familar. There is
				2701	already a C parser written in Haskell by someone in the
				2702	Haskell community, and that would probably be a good starting
				2703	point.</para>
				2704	</listitem>
				2705
				2706
				2707	<listitem>
				2708	<para>The second complication is how to get rid of these
				2709	<computeroutput>NOACCESS</computeroutput> records inside
				2710	Valgrind when the instrumented function exits; after all,
				2711	these refer to stack addresses and will make no sense
				2712	whatever when some other function happens to re-use the same
				2713	stack address range, probably shortly afterwards. I think I
				2714	would be inclined to define a special stack-specific
				2715	macro:</para>
				2716	<programlisting><![CDATA[
				2717	VALGRIND_MAKE_NOACCESS_STACK(addr, len)]]></programlisting>
				2718	<para>which causes Valgrind to record the client's
				2719	<computeroutput>%ESP</computeroutput> at the time it is
				2720	executed. Valgrind will then watch for changes in
				2721	<computeroutput>%ESP</computeroutput> and discard such
				2722	records as soon as the protected area is uncovered by an
				2723	increase in <computeroutput>%ESP</computeroutput>. I
				2724	hesitate with this scheme only because it is potentially
				2725	expensive, if there are hundreds of such records, and
				2726	considering that changes in
				2727	<computeroutput>%ESP</computeroutput> already require
				2728	expensive messing with stack access permissions.</para>
				2729	</listitem>
				2730	</orderedlist>
				2731
				2732	<para>This is probably easier and more robust than for the
				2733	instrumenter program to try and spot all exit points for the
				2734	procedure and place suitable deallocation annotations there.
				2735	Plus C++ procedures can bomb out at any point if they get an
				2736	exception, so spotting return points at the source level just
				2737	won't work at all.</para>
				2738
				2739	<para>Although some work, it's all eminently doable, and it would
				2740	make Valgrind into an even-more-useful tool.</para>
				2741
				2742	</sect3>
				2743
				2744	</sect2>
				2745
				2746	</sect1>
				2747	</chapter>