Blame - memcheck/docs/mc-tech-docs.xml - fp2-dev/platform/external/valgrind

blob: 28193300cf03b560d97dfc3e6fb2eb64a82c525b [file] [log] [blame]

njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
				4
				5	<chapter id="mc-tech-docs"
				6	xreflabel="The design and implementation of Valgrind">
				7
				8	<title>The Design and Implementation of Valgrind</title>
				9	<subtitle>Detailed technical notes for hackers, maintainers and
				10	the overly-curious</subtitle>
				11
				12	<sect1 id="mc-tech-docs.intro" xreflabel="Introduction">
				13	<title>Introduction</title>
				14
				15	<para>This document contains a detailed, highly-technical
				16	description of the internals of Valgrind. This is not the user
				17	manual; if you are an end-user of Valgrind, you do not want to
				18	read this. Conversely, if you really are a hacker-type and want
				19	to know how it works, I assume that you have read the user manual
				20	thoroughly.</para>
				21
				22	<para>You may need to read this document several times, and
				23	carefully. Some important things, I only say once.</para>
				24
njn	c4fcca3	2004-12-01 00:02:36 +0000	[diff] [blame]	25	<para>[Note: this document is now very old, and a lot of its contents are out
				26	of date, and misleading.]</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	27
				28
				29	<sect2 id="mc-tech-docs.history" xreflabel="History">
				30	<title>History</title>
				31
				32	<para>Valgrind came into public view in late Feb 2002. However,
				33	it has been under contemplation for a very long time, perhaps
				34	seriously for about five years. Somewhat over two years ago, I
				35	started working on the x86 code generator for the Glasgow Haskell
				36	Compiler (http://www.haskell.org/ghc), gaining familiarity with
njn	21f9195	2005-03-12 22:14:42 +0000	[diff] [blame]	37	x86 internals on the way. I then did Cacheprof,
				38	gaining further x86 experience. Some
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	39	time around Feb 2000 I started experimenting with a user-space
				40	x86 interpreter for x86-Linux. This worked, but it was clear
				41	that a JIT-based scheme would be necessary to give reasonable
				42	performance for Valgrind. Design work for the JITter started in
				43	earnest in Oct 2000, and by early 2001 I had an x86-to-x86
				44	dynamic translator which could run quite large programs. This
				45	translator was in a sense pointless, since it did not do any
				46	instrumentation or checking.</para>
				47
				48	<para>Most of the rest of 2001 was taken up designing and
				49	implementing the instrumentation scheme. The main difficulty,
				50	which consumed a lot of effort, was to design a scheme which did
				51	not generate large numbers of false uninitialised-value warnings.
				52	By late 2001 a satisfactory scheme had been arrived at, and I
				53	started to test it on ever-larger programs, with an eventual eye
				54	to making it work well enough so that it was helpful to folks
				55	debugging the upcoming version 3 of KDE. I've used KDE since
				56	before version 1.0, and wanted to Valgrind to be an indirect
				57	contribution to the KDE 3 development effort. At the start of
				58	Feb 02 the kde-core-devel crew started using it, and gave a huge
				59	amount of helpful feedback and patches in the space of three
				60	weeks. Snapshot 20020306 is the result.</para>
				61
				62	<para>In the best Unix tradition, or perhaps in the spirit of
				63	Fred Brooks' depressing-but-completely-accurate epitaph "build
				64	one to throw away; you will anyway", much of Valgrind is a second
				65	or third rendition of the initial idea. The instrumentation
				66	machinery (<filename>vg_translate.c</filename>,
				67	<filename>vg_memory.c</filename>) and core CPU simulation
				68	(<filename>vg_to_ucode.c</filename>,
				69	<filename>vg_from_ucode.c</filename>) have had three redesigns
				70	and rewrites; the register allocator, low-level memory manager
				71	(<filename>vg_malloc2.c</filename>) and symbol table reader
				72	(<filename>vg_symtab2.c</filename>) are on the second rewrite.
				73	In a sense, this document serves to record some of the knowledge
				74	gained as a result.</para>
				75
				76	</sect2>
				77
				78
				79	<sect2 id="mc-tech-docs.overview" xreflabel="Design overview">
				80	<title>Design overview</title>
				81
				82	<para>Valgrind is compiled into a Linux shared object,
				83	<filename>valgrind.so</filename>, and also a dummy one,
				84	<filename>valgrinq.so</filename>, of which more later. The
				85	<filename>valgrind</filename> shell script adds
				86	<filename>valgrind.so</filename> to the
				87	<computeroutput>LD_PRELOAD</computeroutput> list of extra
				88	libraries to be loaded with any dynamically linked library. This
				89	is a standard trick, one which I assume the
				90	<computeroutput>LD_PRELOAD</computeroutput> mechanism was
				91	developed to support.</para>
				92
				93	<para><filename>valgrind.so</filename> is linked with the
				94	<computeroutput>-z initfirst</computeroutput> flag, which
				95	requests that its initialisation code is run before that of any
				96	other object in the executable image. When this happens,
				97	valgrind gains control. The real CPU becomes "trapped" in
				98	<filename>valgrind.so</filename> and the translations it
				99	generates. The synthetic CPU provided by Valgrind does, however,
				100	return from this initialisation function. So the normal startup
				101	actions, orchestrated by the dynamic linker
				102	<filename>ld.so</filename>, continue as usual, except on the
				103	synthetic CPU, not the real one. Eventually
				104	<computeroutput>main</computeroutput> is run and returns, and
				105	then the finalisation code of the shared objects is run,
				106	presumably in inverse order to which they were initialised.
				107	Remember, this is still all happening on the simulated CPU.
				108	Eventually <filename>valgrind.so</filename>'s own finalisation
				109	code is called. It spots this event, shuts down the simulated
				110	CPU, prints any error summaries and/or does leak detection, and
				111	returns from the initialisation code on the real CPU. At this
				112	point, in effect the real and synthetic CPUs have merged back
				113	into one, Valgrind has lost control of the program, and the
				114	program finally <computeroutput>exit()s</computeroutput> back to
				115	the kernel in the usual way.</para>
				116
				117	<para>The normal course of activity, once Valgrind has started
				118	up, is as follows. Valgrind never runs any part of your program
				119	(usually referred to as the "client"), not a single byte of it,
				120	directly. Instead it uses function
				121	<computeroutput>VG_(translate)</computeroutput> to translate
				122	basic blocks (BBs, straight-line sequences of code) into
				123	instrumented translations, and those are run instead. The
				124	translations are stored in the translation cache (TC),
				125	<computeroutput>vg_tc</computeroutput>, with the translation
				126	table (TT), <computeroutput>vg_tt</computeroutput> supplying the
				127	original-to-translation code address mapping. Auxiliary array
				128	<computeroutput>VG_(tt_fast)</computeroutput> is used as a
				129	direct-map cache for fast lookups in TT; it usually achieves a
				130	hit rate of around 98% and facilitates an orig-to-trans lookup in
				131	4 x86 insns, which is not bad.</para>
				132
				133	<para>Function <computeroutput>VG_(dispatch)</computeroutput> in
				134	<filename>vg_dispatch.S</filename> is the heart of the JIT
				135	dispatcher. Once a translated code address has been found, it is
				136	executed simply by an x86 <computeroutput>call</computeroutput>
				137	to the translation. At the end of the translation, the next
				138	original code addr is loaded into
				139	<computeroutput>%eax</computeroutput>, and the translation then
				140	does a <computeroutput>ret</computeroutput>, taking it back to
				141	the dispatch loop, with, interestingly, zero branch
				142	mispredictions. The address requested in
				143	<computeroutput>%eax</computeroutput> is looked up first in
				144	<computeroutput>VG_(tt_fast)</computeroutput>, and, if not found,
				145	by calling C helper
				146	<computeroutput>VG_(search_transtab)</computeroutput>. If there
				147	is still no translation available,
				148	<computeroutput>VG_(dispatch)</computeroutput> exits back to the
				149	top-level C dispatcher
				150	<computeroutput>VG_(toploop)</computeroutput>, which arranges for
				151	<computeroutput>VG_(translate)</computeroutput> to make a new
				152	translation. All fairly unsurprising, really. There are various
				153	complexities described below.</para>
				154
				155	<para>The translator, orchestrated by
				156	<computeroutput>VG_(translate)</computeroutput>, is complicated
				157	but entirely self-contained. It is described in great detail in
				158	subsequent sections. Translations are stored in TC, with TT
				159	tracking administrative information. The translations are
				160	subject to an approximate LRU-based management scheme. With the
				161	current settings, the TC can hold at most about 15MB of
				162	translations, and LRU passes prune it to about 13.5MB. Given
				163	that the orig-to-translation expansion ratio is about 13:1 to
				164	14:1, this means TC holds translations for more or less a
				165	megabyte of original code, which generally comes to about 70000
				166	basic blocks for C++ compiled with optimisation on. Generating
				167	new translations is expensive, so it is worth having a large TC
				168	to minimise the (capacity) miss rate.</para>
				169
				170	<para>The dispatcher,
				171	<computeroutput>VG_(dispatch)</computeroutput>, receives hints
				172	from the translations which allow it to cheaply spot all control
				173	transfers corresponding to x86
				174	<computeroutput>call</computeroutput> and
				175	<computeroutput>ret</computeroutput> instructions. It has to do
				176	this in order to spot some special events:</para>
				177
				178	<itemizedlist>
				179	<listitem>
				180	<para>Calls to
				181	<computeroutput>VG_(shutdown)</computeroutput>. This is
				182	Valgrind's cue to exit. NOTE: actually this is done a
				183	different way; it should be cleaned up.</para>
				184	</listitem>
				185
				186	<listitem>
				187	<para>Returns of system call handlers, to the return address
				188	<computeroutput>VG_(signalreturn_bogusRA)</computeroutput>.
				189	The signal simulator needs to know when a signal handler is
				190	returning, so we spot jumps (returns) to this address.</para>
				191	</listitem>
				192
				193	<listitem>
				194	<para>Calls to <computeroutput>vg_trap_here</computeroutput>.
				195	All <computeroutput>malloc</computeroutput>,
				196	<computeroutput>free</computeroutput>, etc calls that the
				197	client program makes are eventually routed to a call to
				198	<computeroutput>vg_trap_here</computeroutput>, and Valgrind
				199	does its own special thing with these calls. In effect this
				200	provides a trapdoor, by which Valgrind can intercept certain
				201	calls on the simulated CPU, run the call as it sees fit
				202	itself (on the real CPU), and return the result to the
				203	simulated CPU, quite transparently to the client
				204	program.</para>
				205	</listitem>
				206
				207	</itemizedlist>
				208
				209	<para>Valgrind intercepts the client's
				210	<computeroutput>malloc</computeroutput>,
				211	<computeroutput>free</computeroutput>, etc, calls, so that it can
				212	store additional information. Each block
				213	<computeroutput>malloc</computeroutput>'d by the client gives
				214	rise to a shadow block in which Valgrind stores the call stack at
				215	the time of the <computeroutput>malloc</computeroutput> call.
				216	When the client calls <computeroutput>free</computeroutput>,
				217	Valgrind tries to find the shadow block corresponding to the
				218	address passed to <computeroutput>free</computeroutput>, and
				219	emits an error message if none can be found. If it is found, the
				220	block is placed on the freed blocks queue
				221	<computeroutput>vg_freed_list</computeroutput>, it is marked as
				222	inaccessible, and its shadow block now records the call stack at
				223	the time of the <computeroutput>free</computeroutput> call.
				224	Keeping <computeroutput>free</computeroutput>'d blocks in this
				225	queue allows Valgrind to spot all (presumably invalid) accesses
				226	to them. However, once the volume of blocks in the free queue
				227	exceeds <computeroutput>VG_(clo_freelist_vol)</computeroutput>,
				228	blocks are finally removed from the queue.</para>
				229
				230	<para>Keeping track of <literal>A</literal> and
				231	<literal>V</literal> bits (note: if you don't know what these
				232	are, you haven't read the user guide carefully enough) for memory
				233	is done in <filename>vg_memory.c</filename>. This implements a
				234	sparse array structure which covers the entire 4G address space
				235	in a way which is reasonably fast and reasonably space efficient.
				236	The 4G address space is divided up into 64K sections, each
				237	covering 64Kb of address space. Given a 32-bit address, the top
				238	16 bits are used to select one of the 65536 entries in
				239	<computeroutput>VG_(primary_map)</computeroutput>. The resulting
				240	"secondary" (<computeroutput>SecMap</computeroutput>) holds A and
				241	V bits for the 64k of address space chunk corresponding to the
				242	lower 16 bits of the address.</para>
				243
				244	</sect2>
				245
				246
				247
				248	<sect2 id="mc-tech-docs.design" xreflabel="Design decisions">
				249	<title>Design decisions</title>
				250
				251	<para>Some design decisions were motivated by the need to make
				252	Valgrind debuggable. Imagine you are writing a CPU simulator.
				253	It works fairly well. However, you run some large program, like
				254	Netscape, and after tens of millions of instructions, it crashes.
				255	How can you figure out where in your simulator the bug is?</para>
				256
				257	<para>Valgrind's answer is: cheat. Valgrind is designed so that
				258	it is possible to switch back to running the client program on
				259	the real CPU at any point. Using the
				260	<computeroutput>--stop-after= </computeroutput> flag, you can ask
				261	Valgrind to run just some number of basic blocks, and then run
				262	the rest of the way on the real CPU. If you are searching for a
				263	bug in the simulated CPU, you can use this to do a binary search,
				264	which quickly leads you to the specific basic block which is
				265	causing the problem.</para>
				266
				267	<para>This is all very handy. It does constrain the design in
				268	certain unimportant ways. Firstly, the layout of memory, when
				269	viewed from the client's point of view, must be identical
				270	regardless of whether it is running on the real or simulated CPU.
				271	This means that Valgrind can't do pointer swizzling -- well, no
				272	great loss -- and it can't run on the same stack as the client --
				273	again, no great loss. Valgrind operates on its own stack,
				274	<computeroutput>VG_(stack)</computeroutput>, which it switches to
				275	at startup, temporarily switching back to the client's stack when
				276	doing system calls for the client.</para>
				277
				278	<para>Valgrind also receives signals on its own stack,
				279	<computeroutput>VG_(sigstack)</computeroutput>, but for different
				280	gruesome reasons discussed below.</para>
				281
				282	<para>This nice clean
				283	switch-back-to-the-real-CPU-whenever-you-like story is muddied by
				284	signals. Problem is that signals arrive at arbitrary times and
				285	tend to slightly perturb the basic block count, with the result
				286	that you can get close to the basic block causing a problem but
				287	can't home in on it exactly. My kludgey hack is to define
				288	<computeroutput>SIGNAL_SIMULATION</computeroutput> to 1 towards
				289	the bottom of <filename>vg_syscall_mem.c</filename>, so that
				290	signal handlers are run on the real CPU and don't change the BB
				291	counts.</para>
				292
				293	<para>A second hole in the switch-back-to-real-CPU story is that
				294	Valgrind's way of delivering signals to the client is different
				295	from that of the kernel. Specifically, the layout of the signal
				296	delivery frame, and the mechanism used to detect a sighandler
				297	returning, are different. So you can't expect to make the
				298	transition inside a sighandler and still have things working, but
				299	in practice that's not much of a restriction.</para>
				300
				301	<para>Valgrind's implementation of
				302	<computeroutput>malloc</computeroutput>,
				303	<computeroutput>free</computeroutput>, etc, (in
				304	<filename>vg_clientmalloc.c</filename>, not the low-level stuff
				305	in <filename>vg_malloc2.c</filename>) is somewhat complicated by
				306	the need to handle switching back at arbitrary points. It does
				307	work tho.</para>
				308
				309	</sect2>
				310
				311
				312
				313	<sect2 id="mc-tech-docs.correctness" xreflabel="Correctness">
				314	<title>Correctness</title>
				315
				316	<para>There's only one of me, and I have a Real Life (tm) as well
				317	as hacking Valgrind [allegedly :-]. That means I don't have time
				318	to waste chasing endless bugs in Valgrind. My emphasis is
				319	therefore on doing everything as simply as possible, with
				320	correctness, stability and robustness being the number one
				321	priority, more important than performance or functionality. As a
				322	result:</para>
				323
				324	<itemizedlist>
				325
				326	<listitem>
				327	<para>The code is absolutely loaded with assertions, and
				328	these are <command>permanently enabled.</command> I have no
				329	plan to remove or disable them later. Over the past couple
				330	of months, as valgrind has become more widely used, they have
				331	shown their worth, pulling up various bugs which would
				332	otherwise have appeared as hard-to-find segmentation
				333	faults.</para>
				334
				335	<para>I am of the view that it's acceptable to spend 5% of
				336	the total running time of your valgrindified program doing
				337	assertion checks and other internal sanity checks.</para>
				338	</listitem>
				339
				340	<listitem>
				341	<para>Aside from the assertions, valgrind contains various
				342	sets of internal sanity checks, which get run at varying
				343	frequencies during normal operation.
				344	<computeroutput>VG_(do_sanity_checks)</computeroutput> runs
				345	every 1000 basic blocks, which means 500 to 2000 times/second
				346	for typical machines at present. It checks that Valgrind
				347	hasn't overrun its private stack, and does some simple checks
				348	on the memory permissions maps. Once every 25 calls it does
				349	some more extensive checks on those maps. Etc, etc.</para>
				350	<para>The following components also have sanity check code,
				351	which can be enabled to aid debugging:</para>
				352	<itemizedlist>
				353	<listitem><para>The low-level memory-manager
				354	(<computeroutput>VG_(mallocSanityCheckArena)</computeroutput>).
				355	This does a complete check of all blocks and chains in an
				356	arena, which is very slow. Is not engaged by default.</para>
				357	</listitem>
				358
				359	<listitem>
				360	<para>The symbol table reader(s): various checks to
				361	ensure uniqueness of mappings; see
				362	<computeroutput>VG_(read_symbols)</computeroutput> for a
				363	start. Is permanently engaged.</para>
				364	</listitem>
				365
				366	<listitem>
				367	<para>The A and V bit tracking stuff in
				368	<filename>vg_memory.c</filename>. This can be compiled
				369	with cpp symbol
				370	<computeroutput>VG_DEBUG_MEMORY</computeroutput> defined,
				371	which removes all the fast, optimised cases, and uses
				372	simple-but-slow fallbacks instead. Not engaged by
				373	default.</para>
				374	</listitem>
				375
				376	<listitem>
				377	<para>Ditto
				378	<computeroutput>VG_DEBUG_LEAKCHECK</computeroutput>.</para>
				379	</listitem>
				380
				381	<listitem>
				382	<para>The JITter parses x86 basic blocks into sequences
				383	of UCode instructions. It then sanity checks each one
				384	with <computeroutput>VG_(saneUInstr)</computeroutput> and
				385	sanity checks the sequence as a whole with
				386	<computeroutput>VG_(saneUCodeBlock)</computeroutput>.
				387	This stuff is engaged by default, and has caught some
				388	way-obscure bugs in the simulated CPU machinery in its
				389	time.</para>
				390	</listitem>
				391
				392	<listitem>
				393	<para>The system call wrapper does
				394	<computeroutput>VG_(first_and_last_secondaries_look_plausible)</computeroutput>
				395	after every syscall; this is known to pick up bugs in the
				396	syscall wrappers. Engaged by default.</para>
				397	</listitem>
				398
				399	<listitem>
				400	<para>The main dispatch loop, in
				401	<computeroutput>VG_(dispatch)</computeroutput>, checks
				402	that translations do not set
				403	<computeroutput>%ebp</computeroutput> to any value
				404	different from
				405	<computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput>
				406	or <computeroutput>& VG_(baseBlock)</computeroutput>.
				407	In effect this test is free, and is permanently
				408	engaged.</para>
				409	</listitem>
				410
				411	<listitem>
				412	<para>There are a couple of ifdefed-out consistency
				413	checks I inserted whilst debugging the new register
				414	allocater,
				415	<computeroutput>vg_do_register_allocation</computeroutput>.</para>
				416	</listitem>
				417	</itemizedlist>
				418	</listitem>
				419
				420	<listitem>
				421	<para>I try to avoid techniques, algorithms, mechanisms, etc,
				422	for which I can supply neither a convincing argument that
				423	they are correct, nor sanity-check code which might pick up
				424	bugs in my implementation. I don't always succeed in this,
				425	but I try. Basically the idea is: avoid techniques which
				426	are, in practice, unverifiable, in some sense. When doing
				427	anything, always have in mind: "how can I verify that this is
				428	correct?"</para>
				429	</listitem>
				430
				431	</itemizedlist>
				432
				433
				434	<para>Some more specific things are:</para>
				435	<itemizedlist>
				436	<listitem>
				437	<para>Valgrind runs in the same namespace as the client, at
				438	least from <filename>ld.so</filename>'s point of view, and it
				439	therefore absolutely had better not export any symbol with a
				440	name which could clash with that of the client or any of its
				441	libraries. Therefore, all globally visible symbols exported
				442	from <filename>valgrind.so</filename> are defined using the
				443	<computeroutput>VG_</computeroutput> CPP macro. As you'll
				444	see from <filename>vg_constants.h</filename>, this appends
				445	some arbitrary prefix to the symbol, in order that it be, we
				446	hope, globally unique. Currently the prefix is
				447	<computeroutput>vgPlain_</computeroutput>. For convenience
				448	there are also <computeroutput>VGM_</computeroutput>,
				449	<computeroutput>VGP_</computeroutput> and
				450	<computeroutput>VGOFF_</computeroutput>. All locally defined
				451	symbols are declared <computeroutput>static</computeroutput>
				452	and do not appear in the final shared object.</para>
				453
				454	<para>To check this, I periodically do <computeroutput>nm
				455	valgrind.so \| grep " T "</computeroutput>, which shows you
				456	all the globally exported text symbols. They should all have
				457	an approved prefix, except for those like
				458	<computeroutput>malloc</computeroutput>,
				459	<computeroutput>free</computeroutput>, etc, which we
				460	deliberately want to shadow and take precedence over the same
				461	names exported from <filename>glibc.so</filename>, so that
				462	valgrind can intercept those calls easily. Similarly,
				463	<computeroutput>nm valgrind.so \| grep " D "</computeroutput>
				464	allows you to find any rogue data-segment symbol
				465	names.</para>
				466	</listitem>
				467
				468	<listitem>
				469	<para>Valgrind tries, and almost succeeds, in being
				470	completely independent of all other shared objects, in
				471	particular of <filename>glibc.so</filename>. For example, we
				472	have our own low-level memory manager in
				473	<filename>vg_malloc2.c</filename>, which is a fairly standard
				474	malloc/free scheme augmented with arenas, and
				475	<filename>vg_mylibc.c</filename> exports reimplementations of
				476	various bits and pieces you'd normally get from the C
				477	library.</para>
				478
				479	<para>Why all the hassle? Because imagine the potential
				480	chaos of both the simulated and real CPUs executing in
				481	<filename>glibc.so</filename>. It just seems simpler and
				482	cleaner to be completely self-contained, so that only the
				483	simulated CPU visits <filename>glibc.so</filename>. In
				484	practice it's not much hassle anyway. Also, valgrind starts
				485	up before glibc has a chance to initialise itself, and who
				486	knows what difficulties that could lead to. Finally, glibc
				487	has definitions for some types, specifically
				488	<computeroutput>sigset_t</computeroutput>, which conflict
				489	(are different from) the Linux kernel's idea of same. When
				490	Valgrind wants to fiddle around with signal stuff, it wants
				491	to use the kernel's definitions, not glibc's definitions. So
				492	it's simplest just to keep glibc out of the picture
				493	entirely.</para>
				494
				495	<para>To find out which glibc symbols are used by Valgrind,
				496	reinstate the link flags <computeroutput>-nostdlib
				497	-Wl,-no-undefined</computeroutput>. This causes linking to
				498	fail, but will tell you what you depend on. I have mostly,
				499	but not entirely, got rid of the glibc dependencies; what
				500	remains is, IMO, fairly harmless. AFAIK the current
				501	dependencies are: <computeroutput>memset</computeroutput>,
				502	<computeroutput>memcmp</computeroutput>,
				503	<computeroutput>stat</computeroutput>,
				504	<computeroutput>system</computeroutput>,
				505	<computeroutput>sbrk</computeroutput>,
				506	<computeroutput>setjmp</computeroutput> and
				507	<computeroutput>longjmp</computeroutput>.</para>
				508	</listitem>
				509
				510	<listitem>
				511	<para>Similarly, valgrind should not really import any
				512	headers other than the Linux kernel headers, since it knows
				513	of no API other than the kernel interface to talk to. At the
				514	moment this is really not in a good state, and
				515	<computeroutput>vg_syscall_mem</computeroutput> imports, via
				516	<filename>vg_unsafe.h</filename>, a significant number of
				517	C-library headers so as to know the sizes of various structs
				518	passed across the kernel boundary. This is of course
				519	completely bogus, since there is no guarantee that the C
				520	library's definitions of these structs matches those of the
				521	kernel. I have started to sort this out using
				522	<filename>vg_kerneliface.h</filename>, into which I had
				523	intended to copy all kernel definitions which valgrind could
				524	need, but this has not gotten very far. At the moment it
				525	mostly contains definitions for
				526	<computeroutput>sigset_t</computeroutput> and
				527	<computeroutput>struct sigaction</computeroutput>, since the
				528	kernel's definition for these really does clash with glibc's.
				529	I plan to use a <computeroutput>vki_</computeroutput> prefix
				530	on all these types and constants, to denote the fact that
				531	they pertain to <command>V</command>algrind's
				532	<command>K</command>ernel
				533	<command>I</command>nterface.</para>
				534
				535	<para>Another advantage of having a
				536	<filename>vg_kerneliface.h</filename> file is that it makes
				537	it simpler to interface to a different kernel. Once can, for
				538	example, easily imagine writing a new
				539	<filename>vg_kerneliface.h</filename> for FreeBSD, or x86
				540	NetBSD.</para>
				541	</listitem>
				542
				543	</itemizedlist>
				544
				545	</sect2>
				546
				547
				548
				549	<sect2 id="mc-tech-docs.limits" xreflabel="Current limitations">
				550	<title>Current limitations</title>
				551
				552	<para>Support for weird (non-POSIX) signal stuff is patchy. Does
				553	anybody care?</para>
				554
				555	</sect2>
				556
				557	</sect1>
				558
				559
				560
				561
				562
				563	<sect1 id="mc-tech-docs.jitter" xreflabel="The instrumenting JITter">
				564	<title>The instrumenting JITter</title>
				565
				566	<para>This really is the heart of the matter. We begin with
				567	various side issues.</para>
				568
				569
				570	<sect2 id="mc-tech-docs.storage"
				571	xreflabel="Run-time storage, and the use of host registers">
				572	<title>Run-time storage, and the use of host registers</title>
				573
				574	<para>Valgrind translates client (original) basic blocks into
				575	instrumented basic blocks, which live in the translation cache
				576	TC, until either the client finishes or the translations are
				577	ejected from TC to make room for newer ones.</para>
				578
				579	<para>Since it generates x86 code in memory, Valgrind has
				580	complete control of the use of registers in the translations.
				581	Now pay attention. I shall say this only once, and it is
				582	important you understand this. In what follows I will refer to
				583	registers in the host (real) cpu using their standard names,
				584	<computeroutput>%eax</computeroutput>,
				585	<computeroutput>%edi</computeroutput>, etc. I refer to registers
				586	in the simulated CPU by capitalising them:
				587	<computeroutput>%EAX</computeroutput>,
				588	<computeroutput>%EDI</computeroutput>, etc. These two sets of
				589	registers usually bear no direct relationship to each other;
				590	there is no fixed mapping between them. This naming scheme is
				591	used fairly consistently in the comments in the sources.</para>
				592
				593	<para>Host registers, once things are up and running, are used as
				594	follows:</para>
				595
				596	<itemizedlist>
				597	<listitem>
				598	<para><computeroutput>%esp</computeroutput>, the real stack
				599	pointer, points somewhere in Valgrind's private stack area,
				600	<computeroutput>VG_(stack)</computeroutput> or, transiently,
				601	into its signal delivery stack,
				602	<computeroutput>VG_(sigstack)</computeroutput>.</para>
				603	</listitem>
				604
				605	<listitem>
				606	<para><computeroutput>%edi</computeroutput> is used as a
				607	temporary in code generation; it is almost always dead,
				608	except when used for the
				609	<computeroutput>Left</computeroutput> value-tag operations.</para>
				610	</listitem>
				611
				612	<listitem>
				613	<para><computeroutput>%eax</computeroutput>,
				614	<computeroutput>%ebx</computeroutput>,
				615	<computeroutput>%ecx</computeroutput>,
				616	<computeroutput>%edx</computeroutput> and
				617	<computeroutput>%esi</computeroutput> are available to
				618	Valgrind's register allocator. They are dead (carry
				619	unimportant values) in between translations, and are live
				620	only in translations. The one exception to this is
				621	<computeroutput>%eax</computeroutput>, which, as mentioned
				622	far above, has a special significance to the dispatch loop
				623	<computeroutput>VG_(dispatch)</computeroutput>: when a
				624	translation returns to the dispatch loop,
				625	<computeroutput>%eax</computeroutput> is expected to contain
				626	the original-code-address of the next translation to run.
				627	The register allocator is so good at minimising spill code
				628	that using five regs and not having to save/restore
				629	<computeroutput>%edi</computeroutput> actually gives better
				630	code than allocating to <computeroutput>%edi</computeroutput>
				631	as well, but then having to push/pop it around special
				632	uses.</para>
				633	</listitem>
				634
				635	<listitem>
				636	<para><computeroutput>%ebp</computeroutput> points
				637	permanently at
				638	<computeroutput>VG_(baseBlock)</computeroutput>. Valgrind's
				639	translations are position-independent, partly because this is
				640	convenient, but also because translations get moved around in
				641	TC as part of the LRUing activity. <command>All</command>
				642	static entities which need to be referred to from generated
				643	code, whether data or helper functions, are stored starting
				644	at <computeroutput>VG_(baseBlock)</computeroutput> and are
				645	therefore reached by indexing from
				646	<computeroutput>%ebp</computeroutput>. There is but one
				647	exception, which is that by placing the value
				648	<computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput> in
				649	<computeroutput>%ebp</computeroutput> just before a return to
				650	the dispatcher, the dispatcher is informed that the next
				651	address to run, in <computeroutput>%eax</computeroutput>,
				652	requires special treatment.</para>
				653	</listitem>
				654
				655	<listitem>
				656	<para>The real machine's FPU state is pretty much
				657	unimportant, for reasons which will become obvious. Ditto
				658	its <computeroutput>%eflags</computeroutput> register.</para>
				659	</listitem>
				660
				661	</itemizedlist>
				662
				663	<para>The state of the simulated CPU is stored in memory, in
				664	<computeroutput>VG_(baseBlock)</computeroutput>, which is a block
				665	of 200 words IIRC. Recall that
				666	<computeroutput>%ebp</computeroutput> points permanently at the
				667	start of this block. Function
				668	<computeroutput>vg_init_baseBlock</computeroutput> decides what
				669	the offsets of various entities in
				670	<computeroutput>VG_(baseBlock)</computeroutput> are to be, and
				671	allocates word offsets for them. The code generator then emits
				672	<computeroutput>%ebp</computeroutput> relative addresses to get
				673	at those things. The sequence in which entities are allocated
				674	has been carefully chosen so that the 32 most popular entities
				675	come first, because this means 8-bit offsets can be used in the
				676	generated code.</para>
				677
				678	<para>If I was clever, I could make
				679	<computeroutput>%ebp</computeroutput> point 32 words along
				680	<computeroutput>VG_(baseBlock)</computeroutput>, so that I'd have
				681	another 32 words of short-form offsets available, but that's just
				682	complicated, and it's not important -- the first 32 words take
				683	99% (or whatever) of the traffic.</para>
				684
				685	<para>Currently, the sequence of stuff in
				686	<computeroutput>VG_(baseBlock)</computeroutput> is as
				687	follows:</para>
				688
				689	<itemizedlist>
				690	<listitem>
				691	<para>9 words, holding the simulated integer registers,
				692	<computeroutput>%EAX</computeroutput>
				693	.. <computeroutput>%EDI</computeroutput>, and the simulated
				694	flags, <computeroutput>%EFLAGS</computeroutput>.</para>
				695	</listitem>
				696
				697	<listitem>
				698	<para>Another 9 words, holding the V bit "shadows" for the
				699	above 9 regs.</para>
				700	</listitem>
				701
				702	<listitem>
				703	<para>The <command>addresses</command> of various helper
				704	routines called from generated code:
				705	<computeroutput>VG_(helper_value_check4_fail)</computeroutput>,
				706	<computeroutput>VG_(helper_value_check0_fail)</computeroutput>,
				707	which register V-check failures,
				708	<computeroutput>VG_(helperc_STOREV4)</computeroutput>,
				709	<computeroutput>VG_(helperc_STOREV1)</computeroutput>,
				710	<computeroutput>VG_(helperc_LOADV4)</computeroutput>,
				711	<computeroutput>VG_(helperc_LOADV1)</computeroutput>, which
				712	do stores and loads of V bits to/from the sparse array which
				713	keeps track of V bits in memory, and
				714	<computeroutput>VGM_(handle_esp_assignment)</computeroutput>,
				715	which messes with memory addressibility resulting from
				716	changes in <computeroutput>%ESP</computeroutput>.</para>
				717	</listitem>
				718
				719	<listitem>
				720	<para>The simulated <computeroutput>%EIP</computeroutput>.</para>
				721	</listitem>
				722
				723	<listitem>
				724	<para>24 spill words, for when the register allocator can't
				725	make it work with 5 measly registers.</para>
				726	</listitem>
				727
				728	<listitem>
				729	<para>Addresses of helpers
				730	<computeroutput>VG_(helperc_STOREV2)</computeroutput>,
				731	<computeroutput>VG_(helperc_LOADV2)</computeroutput>. These
				732	are here because 2-byte loads and stores are relatively rare,
				733	so are placed above the magic 32-word offset boundary.</para>
				734	</listitem>
				735
				736	<listitem>
				737	<para>For similar reasons, addresses of helper functions
				738	<computeroutput>VGM_(fpu_write_check)</computeroutput> and
				739	<computeroutput>VGM_(fpu_read_check)</computeroutput>, which
				740	handle the A/V maps testing and changes required by FPU
				741	writes/reads.</para>
				742	</listitem>
				743
				744	<listitem>
				745	<para>Some other boring helper addresses:
				746	<computeroutput>VG_(helper_value_check2_fail)</computeroutput>
				747	and
				748	<computeroutput>VG_(helper_value_check1_fail)</computeroutput>.
				749	These are probably never emitted now, and should be
				750	removed.</para>
				751	</listitem>
				752
				753	<listitem>
				754	<para>The entire state of the simulated FPU, which I believe
				755	to be 108 bytes long.</para>
				756	</listitem>
				757
				758	<listitem>
				759	<para>Finally, the addresses of various other helper
				760	functions in <filename>vg_helpers.S</filename>, which deal
				761	with rare situations which are tedious or difficult to
				762	generate code in-line for.</para>
				763	</listitem>
				764
				765	</itemizedlist>
				766
				767	<para>As a general rule, the simulated machine's state lives
				768	permanently in memory at
				769	<computeroutput>VG_(baseBlock)</computeroutput>. However, the
				770	JITter does some optimisations which allow the simulated integer
				771	registers to be cached in real registers over multiple simulated
				772	instructions within the same basic block. These are always
				773	flushed back into memory at the end of every basic block, so that
				774	the in-memory state is up-to-date between basic blocks. (This
				775	flushing is implied by the statement above that the real
				776	machine's allocatable registers are dead in between simulated
				777	blocks).</para>
				778
				779	</sect2>
				780
				781
				782
				783	<sect2 id="mc-tech-docs.startup"
				784	xreflabel="Startup, shutdown, and system calls">
				785	<title>Startup, shutdown, and system calls</title>
				786
				787	<para>Getting into of Valgrind
				788	(<computeroutput>VG_(startup)</computeroutput>, called from
				789	<filename>valgrind.so</filename>'s initialisation section),
				790	really means copying the real CPU's state into
				791	<computeroutput>VG_(baseBlock)</computeroutput>, and then
				792	installing our own stack pointer, etc, into the real CPU, and
				793	then starting up the JITter. Exiting valgrind involves copying
				794	the simulated state back to the real state.</para>
				795
				796	<para>Unfortunately, there's a complication at startup time.
				797	Problem is that at the point where we need to take a snapshot of
				798	the real CPU's state, the offsets in
				799	<computeroutput>VG_(baseBlock)</computeroutput> are not set up
				800	yet, because to do so would involve disrupting the real machine's
				801	state significantly. The way round this is to dump the real
				802	machine's state into a temporary, static block of memory,
				803	<computeroutput>VG_(m_state_static)</computeroutput>. We can
				804	then set up the <computeroutput>VG_(baseBlock)</computeroutput>
				805	offsets at our leisure, and copy into it from
				806	<computeroutput>VG_(m_state_static)</computeroutput> at some
				807	convenient later time. This copying is done by
				808	<computeroutput>VG_(copy_m_state_static_to_baseBlock)</computeroutput>.</para>
				809
				810	<para>On exit, the inverse transformation is (rather
				811	unnecessarily) used: stuff in
				812	<computeroutput>VG_(baseBlock)</computeroutput> is copied to
				813	<computeroutput>VG_(m_state_static)</computeroutput>, and the
				814	assembly stub then copies from
				815	<computeroutput>VG_(m_state_static)</computeroutput> into the
				816	real machine registers.</para>
				817
				818	<para>Doing system calls on behalf of the client
				819	(<filename>vg_syscall.S</filename>) is something of a half-way
				820	house. We have to make the world look sufficiently like that
				821	which the client would normally have to make the syscall actually
				822	work properly, but we can't afford to lose control. So the trick
				823	is to copy all of the client's state, <command>except its program
				824	counter</command>, into the real CPU, do the system call, and
				825	copy the state back out. Note that the client's state includes
				826	its stack pointer register, so one effect of this partial
				827	restoration is to cause the system call to be run on the client's
				828	stack, as it should be.</para>
				829
				830	<para>As ever there are complications. We have to save some of
				831	our own state somewhere when restoring the client's state into
				832	the CPU, so that we can keep going sensibly afterwards. In fact
				833	the only thing which is important is our own stack pointer, but
				834	for paranoia reasons I save and restore our own FPU state as
				835	well, even though that's probably pointless.</para>
				836
				837	<para>The complication on the above complication is, that for
				838	horrible reasons to do with signals, we may have to handle a
				839	second client system call whilst the client is blocked inside
				840	some other system call (unbelievable!). That means there's two
				841	sets of places to dump Valgrind's stack pointer and FPU state
				842	across the syscall, and we decide which to use by consulting
				843	<computeroutput>VG_(syscall_depth)</computeroutput>, which is in
				844	turn maintained by
				845	<computeroutput>VG_(wrap_syscall)</computeroutput>.</para>
				846
				847	</sect2>
				848
				849
				850
				851	<sect2 id="mc-tech-docs.ucode" xreflabel="Introduction to UCode">
				852	<title>Introduction to UCode</title>
				853
				854	<para>UCode lies at the heart of the x86-to-x86 JITter. The
				855	basic premise is that dealing the the x86 instruction set head-on
				856	is just too darn complicated, so we do the traditional
				857	compiler-writer's trick and translate it into a simpler,
				858	easier-to-deal-with form.</para>
				859
				860	<para>In normal operation, translation proceeds through six
				861	stages, coordinated by
				862	<computeroutput>VG_(translate)</computeroutput>:</para>
				863
				864	<orderedlist>
				865	<listitem>
				866	<para>Parsing of an x86 basic block into a sequence of UCode
				867	instructions (<computeroutput>VG_(disBB)</computeroutput>).</para>
				868	</listitem>
				869
				870	<listitem>
				871	<para>UCode optimisation
				872	(<computeroutput>vg_improve</computeroutput>), with the aim
				873	of caching simulated registers in real registers over
				874	multiple simulated instructions, and removing redundant
				875	simulated <computeroutput>%EFLAGS</computeroutput>
				876	saving/restoring.</para>
				877	</listitem>
				878
				879	<listitem>
				880	<para>UCode instrumentation
				881	(<computeroutput>vg_instrument</computeroutput>), which adds
				882	value and address checking code.</para>
				883	</listitem>
				884
				885	<listitem>
				886	<para>Post-instrumentation cleanup
				887	(<computeroutput>vg_cleanup</computeroutput>), removing
				888	redundant value-check computations.</para>
				889	</listitem>
				890
				891	<listitem>
				892	<para>Register allocation
				893	(<computeroutput>vg_do_register_allocation</computeroutput>),
				894	which, note, is done on UCode.</para>
				895	</listitem>
				896
				897	<listitem>
				898	<para>Emission of final instrumented x86 code
				899	(<computeroutput>VG_(emit_code)</computeroutput>).</para>
				900	</listitem>
				901
				902	</orderedlist>
				903
				904	<para>Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
				905	transformation passes, all on straight-line blocks of UCode (type
				906	<computeroutput>UCodeBlock</computeroutput>). Steps 2 and 4 are
				907	optimisation passes and can be disabled for debugging purposes,
				908	with <computeroutput>--optimise=no</computeroutput> and
				909	<computeroutput>--cleanup=no</computeroutput> respectively.</para>
				910
				911	<para>Valgrind can also run in a no-instrumentation mode, given
				912	<computeroutput>--instrument=no</computeroutput>. This is useful
				913	for debugging the JITter quickly without having to deal with the
				914	complexity of the instrumentation mechanism too. In this mode,
				915	steps 3 and 4 are omitted.</para>
				916
				917	<para>These flags combine, so that
				918	<computeroutput>--instrument=no</computeroutput> together with
				919	<computeroutput>--optimise=no</computeroutput> means only steps
				920	1, 5 and 6 are used.
				921	<computeroutput>--single-step=yes</computeroutput> causes each
				922	x86 instruction to be treated as a single basic block. The
				923	translations are terrible but this is sometimes instructive.</para>
				924
				925	<para>The <computeroutput>--stop-after=N</computeroutput> flag
				926	switches back to the real CPU after
				927	<computeroutput>N</computeroutput> basic blocks. It also re-JITs
				928	the final basic block executed and prints the debugging info
				929	resulting, so this gives you a way to get a quick snapshot of how
				930	a basic block looks as it passes through the six stages mentioned
				931	above. If you want to see full information for every block
				932	translated (probably not, but still ...) find, in
				933	<computeroutput>VG_(translate)</computeroutput>, the lines</para>
				934	<programlisting><![CDATA[
				935	dis = True;
				936	dis = debugging_translation;]]></programlisting>
				937
				938	<para>and comment out the second line. This will spew out
				939	debugging junk faster than you can possibly imagine.</para>
				940
				941	</sect2>
				942
				943
				944
				945	<sect2 id="mc-tech-docs.tags" xreflabel="UCode operand tags: type 'Tag'">
				946	<title>UCode operand tags: type <computeroutput>Tag</computeroutput></title>
				947
				948	<para>UCode is, more or less, a simple two-address RISC-like
				949	code. In keeping with the x86 AT&T assembly syntax,
				950	generally speaking the first operand is the source operand, and
				951	the second is the destination operand, which is modified when the
				952	uinstr is notionally executed.</para>
				953
				954	<para>UCode instructions have up to three operand fields, each of
				955	which has a corresponding <computeroutput>Tag</computeroutput>
				956	describing it. Possible values for the tag are:</para>
				957
				958	<itemizedlist>
				959
				960	<listitem>
				961	<para><computeroutput>NoValue</computeroutput>: indicates
				962	that the field is not in use.</para>
				963	</listitem>
				964
				965	<listitem>
				966	<para><computeroutput>Lit16</computeroutput>: the field
				967	contains a 16-bit literal.</para>
				968	</listitem>
				969
				970	<listitem>
				971	<para><computeroutput>Literal</computeroutput>: the field
				972	denotes a 32-bit literal, whose value is stored in the
				973	<computeroutput>lit32</computeroutput> field of the uinstr
				974	itself. Since there is only one
				975	<computeroutput>lit32</computeroutput> for the whole uinstr,
				976	only one operand field may contain this tag.</para>
				977	</listitem>
				978
				979	<listitem>
				980	<para><computeroutput>SpillNo</computeroutput>: the field
				981	contains a spill slot number, in the range 0 to 23 inclusive,
				982	denoting one of the spill slots contained inside
				983	<computeroutput>VG_(baseBlock)</computeroutput>. Such tags
				984	only exist after register allocation.</para>
				985	</listitem>
				986
				987	<listitem>
				988	<para><computeroutput>RealReg</computeroutput>: the field
				989	contains a number in the range 0 to 7 denoting an integer x86
				990	("real") register on the host. The number is the Intel
				991	encoding for integer registers. Such tags only exist after
				992	register allocation.</para>
				993	</listitem>
				994
				995	<listitem>
				996	<para><computeroutput>ArchReg</computeroutput>: the field
				997	contains a number in the range 0 to 7 denoting an integer x86
				998	register on the simulated CPU. In reality this means a
				999	reference to one of the first 8 words of
				1000	<computeroutput>VG_(baseBlock)</computeroutput>. Such tags
				1001	can exist at any point in the translation process.</para>
				1002	</listitem>
				1003
				1004	<listitem>
				1005	<para>Last, but not least,
				1006	<computeroutput>TempReg</computeroutput>. The field contains
				1007	the number of one of an infinite set of virtual (integer)
				1008	registers. <computeroutput>TempReg</computeroutput>s are used
				1009	everywhere throughout the translation process; you can have
				1010	as many as you want. The register allocator maps as many as
				1011	it can into <computeroutput>RealReg</computeroutput>s and
				1012	turns the rest into
				1013	<computeroutput>SpillNo</computeroutput>s, so
				1014	<computeroutput>TempReg</computeroutput>s should not exist
				1015	after the register allocation phase.</para>
				1016
				1017	<para><computeroutput>TempReg</computeroutput>s are always 32
				1018	bits long, even if the data they hold is logically shorter.
				1019	In that case the upper unused bits are required, and, I
				1020	think, generally assumed, to be zero.
				1021	<computeroutput>TempReg</computeroutput>s holding V bits for
				1022	quantities shorter than 32 bits are expected to have ones in
				1023	the unused places, since a one denotes "undefined".</para>
				1024	</listitem>
				1025
				1026	</itemizedlist>
				1027
				1028	</sect2>
				1029
				1030
				1031
				1032	<sect2 id="mc-tech-docs.uinstr"
				1033	xreflabel="UCode instructions: type 'UInstr'">
				1034	<title>UCode instructions: type <computeroutput>UInstr</computeroutput></title>
				1035
				1036	<para>UCode was carefully designed to make it possible to do
				1037	register allocation on UCode and then translate the result into
				1038	x86 code without needing any extra registers ... well, that was
				1039	the original plan, anyway. Things have gotten a little more
				1040	complicated since then. In what follows, UCode instructions are
				1041	referred to as uinstrs, to distinguish them from x86
				1042	instructions. Uinstrs of course have uopcodes which are
				1043	(naturally) different from x86 opcodes.</para>
				1044
				1045	<para>A uinstr (type <computeroutput>UInstr</computeroutput>)
				1046	contains various fields, not all of which are used by any one
				1047	uopcode:</para>
				1048
				1049	<itemizedlist>
				1050
				1051	<listitem>
				1052	<para>Three 16-bit operand fields,
				1053	<computeroutput>val1</computeroutput>,
				1054	<computeroutput>val2</computeroutput> and
				1055	<computeroutput>val3</computeroutput>.</para>
				1056	</listitem>
				1057
				1058	<listitem>
				1059	<para>Three tag fields,
				1060	<computeroutput>tag1</computeroutput>,
				1061	<computeroutput>tag2</computeroutput> and
				1062	<computeroutput>tag3</computeroutput>. Each of these has a
				1063	value of type <computeroutput>Tag</computeroutput>, and they
				1064	describe what the <computeroutput>val1</computeroutput>,
				1065	<computeroutput>val2</computeroutput> and
				1066	<computeroutput>val3</computeroutput> fields contain.</para>
				1067	</listitem>
				1068
				1069	<listitem>
				1070	<para>A 32-bit literal field.</para>
				1071	</listitem>
				1072
				1073	<listitem>
				1074	<para>Two <computeroutput>FlagSet</computeroutput>s,
				1075	specifying which x86 condition codes are read and written by
				1076	the uinstr.</para>
				1077	</listitem>
				1078
				1079	<listitem>
				1080	<para>An opcode byte, containing a value of type
				1081	<computeroutput>Opcode</computeroutput>.</para>
				1082	</listitem>
				1083
				1084	<listitem>
				1085	<para>A size field, indicating the data transfer size
				1086	(1/2/4/8/10) in cases where this makes sense, or zero
				1087	otherwise.</para>
				1088	</listitem>
				1089
				1090	<listitem>
				1091	<para>A condition-code field, which, for jumps, holds a value
				1092	of type <computeroutput>Condcode</computeroutput>, indicating
				1093	the condition which applies. The encoding is as it is in the
				1094	x86 insn stream, except we add a 17th value
				1095	<computeroutput>CondAlways</computeroutput> to indicate an
				1096	unconditional transfer.</para>
				1097	</listitem>
				1098
				1099	<listitem>
				1100	<para>Various 1-bit flags, indicating whether this insn
				1101	pertains to an x86 CALL or RET instruction, whether a
				1102	widening is signed or not, etc.</para>
				1103	</listitem>
				1104
				1105	</itemizedlist>
				1106
				1107	<para>UOpcodes (type <computeroutput>Opcode</computeroutput>) are
				1108	divided into two groups: those necessary merely to express the
				1109	functionality of the x86 code, and extra uopcodes needed to
				1110	express the instrumentation. The former group contains:</para>
				1111
				1112	<itemizedlist>
				1113
				1114	<listitem>
				1115	<para><computeroutput>GET</computeroutput> and
				1116	<computeroutput>PUT</computeroutput>, which move values from
				1117	the simulated CPU's integer registers
				1118	(<computeroutput>ArchReg</computeroutput>s) into
				1119	<computeroutput>TempReg</computeroutput>s, and back.
				1120	<computeroutput>GETF</computeroutput> and
				1121	<computeroutput>PUTF</computeroutput> do the corresponding
				1122	thing for the simulated
				1123	<computeroutput>%EFLAGS</computeroutput>. There are no
				1124	corresponding insns for the FPU register stack, since we
				1125	don't explicitly simulate its registers.</para>
				1126	</listitem>
				1127
				1128	<listitem>
				1129	<para><computeroutput>LOAD</computeroutput> and
				1130	<computeroutput>STORE</computeroutput>, which, in RISC-like
				1131	fashion, are the only uinstrs able to interact with
				1132	memory.</para>
				1133	</listitem>
				1134
				1135	<listitem>
				1136	<para><computeroutput>MOV</computeroutput> and
				1137	<computeroutput>CMOV</computeroutput> allow unconditional and
				1138	conditional moves of values between
				1139	<computeroutput>TempReg</computeroutput>s.</para>
				1140	</listitem>
				1141
				1142	<listitem>
				1143	<para>ALU operations. Again in RISC-like fashion, these only
				1144	operate on <computeroutput>TempReg</computeroutput>s (before
				1145	reg-alloc) or <computeroutput>RealReg</computeroutput>s
				1146	(after reg-alloc). These are:
				1147	<computeroutput>ADD</computeroutput>,
				1148	<computeroutput>ADC</computeroutput>,
				1149	<computeroutput>AND</computeroutput>,
				1150	<computeroutput>OR</computeroutput>,
				1151	<computeroutput>XOR</computeroutput>,
				1152	<computeroutput>SUB</computeroutput>,
				1153	<computeroutput>SBB</computeroutput>,
				1154	<computeroutput>SHL</computeroutput>,
				1155	<computeroutput>SHR</computeroutput>,
				1156	<computeroutput>SAR</computeroutput>,
				1157	<computeroutput>ROL</computeroutput>,
				1158	<computeroutput>ROR</computeroutput>,
				1159	<computeroutput>RCL</computeroutput>,
				1160	<computeroutput>RCR</computeroutput>,
				1161	<computeroutput>NOT</computeroutput>,
				1162	<computeroutput>NEG</computeroutput>,
				1163	<computeroutput>INC</computeroutput>,
				1164	<computeroutput>DEC</computeroutput>,
				1165	<computeroutput>BSWAP</computeroutput>,
				1166	<computeroutput>CC2VAL</computeroutput> and
				1167	<computeroutput>WIDEN</computeroutput>.
				1168	<computeroutput>WIDEN</computeroutput> does signed or
				1169	unsigned value widening.
				1170	<computeroutput>CC2VAL</computeroutput> is used to convert
				1171	condition codes into a value, zero or one. The rest are
				1172	obvious.</para>
				1173
				1174	<para>To allow for more efficient code generation, we bend
				1175	slightly the restriction at the start of the previous para:
				1176	for <computeroutput>ADD</computeroutput>,
				1177	<computeroutput>ADC</computeroutput>,
				1178	<computeroutput>XOR</computeroutput>,
				1179	<computeroutput>SUB</computeroutput> and
				1180	<computeroutput>SBB</computeroutput>, we allow the first
				1181	(source) operand to also be an
				1182	<computeroutput>ArchReg</computeroutput>, that is, one of the
				1183	simulated machine's registers. Also, many of these ALU ops
				1184	allow the source operand to be a literal. See
				1185	<computeroutput>VG_(saneUInstr)</computeroutput> for the
				1186	final word on the allowable forms of uinstrs.</para>
				1187	</listitem>
				1188
				1189	<listitem>
				1190	<para><computeroutput>LEA1</computeroutput> and
				1191	<computeroutput>LEA2</computeroutput> are not strictly
				1192	necessary, but allow faciliate better translations. They
				1193	record the fancy x86 addressing modes in a direct way, which
				1194	allows those amodes to be emitted back into the final
				1195	instruction stream more or less verbatim.</para>
				1196	</listitem>
				1197
				1198	<listitem>
				1199	<para><computeroutput>CALLM</computeroutput> calls a
				1200	machine-code helper, one of the methods whose address is
				1201	stored at some
				1202	<computeroutput>VG_(baseBlock)</computeroutput> offset.
				1203	<computeroutput>PUSH</computeroutput> and
				1204	<computeroutput>POP</computeroutput> move values to/from
				1205	<computeroutput>TempReg</computeroutput> to the real
				1206	(Valgrind's) stack, and
				1207	<computeroutput>CLEAR</computeroutput> removes values from
				1208	the stack. <computeroutput>CALLM_S</computeroutput> and
				1209	<computeroutput>CALLM_E</computeroutput> delimit the
				1210	boundaries of call setups and clearings, for the benefit of
				1211	the instrumentation passes. Getting this right is critical,
				1212	and so <computeroutput>VG_(saneUCodeBlock)</computeroutput>
				1213	makes various checks on the use of these uopcodes.</para>
				1214
				1215	<para>It is important to understand that these uopcodes have
				1216	nothing to do with the x86
				1217	<computeroutput>call</computeroutput>,
				1218	<computeroutput>return,</computeroutput>
				1219	<computeroutput>push</computeroutput> or
				1220	<computeroutput>pop</computeroutput> instructions, and are
				1221	not used to implement them. Those guys turn into
				1222	combinations of <computeroutput>GET</computeroutput>,
				1223	<computeroutput>PUT</computeroutput>,
				1224	<computeroutput>LOAD</computeroutput>,
				1225	<computeroutput>STORE</computeroutput>,
				1226	<computeroutput>ADD</computeroutput>,
				1227	<computeroutput>SUB</computeroutput>, and
				1228	<computeroutput>JMP</computeroutput>. What these uopcodes
				1229	support is calling of helper functions such as
				1230	<computeroutput>VG_(helper_imul_32_64)</computeroutput>,
				1231	which do stuff which is too difficult or tedious to emit
				1232	inline.</para>
				1233	</listitem>
				1234
				1235	<listitem>
				1236	<para><computeroutput>FPU</computeroutput>,
				1237	<computeroutput>FPU_R</computeroutput> and
				1238	<computeroutput>FPU_W</computeroutput>. Valgrind doesn't
				1239	attempt to simulate the internal state of the FPU at all.
				1240	Consequently it only needs to be able to distinguish FPU ops
				1241	which read and write memory from those that don't, and for
				1242	those which do, it needs to know the effective address and
				1243	data transfer size. This is made easier because the x86 FP
				1244	instruction encoding is very regular, basically consisting of
				1245	16 bits for a non-memory FPU insn and 11 (IIRC) bits + an
				1246	address mode for a memory FPU insn. So our
				1247	<computeroutput>FPU</computeroutput> uinstr carries the 16
				1248	bits in its <computeroutput>val1</computeroutput> field. And
				1249	<computeroutput>FPU_R</computeroutput> and
				1250	<computeroutput>FPU_W</computeroutput> carry 11 bits in that
				1251	field, together with the identity of a
				1252	<computeroutput>TempReg</computeroutput> or (later)
				1253	<computeroutput>RealReg</computeroutput> which contains the
				1254	address.</para>
				1255	</listitem>
				1256
				1257	<listitem>
				1258	<para><computeroutput>JIFZ</computeroutput> is unique, in
				1259	that it allows a control-flow transfer which is not deemed to
				1260	end a basic block. It causes a jump to a literal (original)
				1261	address if the specified argument is zero.</para>
				1262	</listitem>
				1263
				1264	<listitem>
				1265	<para>Finally, <computeroutput>INCEIP</computeroutput>
				1266	advances the simulated <computeroutput>%EIP</computeroutput>
				1267	by the specified literal amount. This supports lazy
				1268	<computeroutput>%EIP</computeroutput> updating, as described
				1269	below.</para>
				1270	</listitem>
				1271
				1272	</itemizedlist>
				1273
				1274	<para>Stages 1 and 2 of the 6-stage translation process mentioned
				1275	above deal purely with these uopcodes, and no others. They are
				1276	sufficient to express pretty much all the x86 32-bit
				1277	protected-mode instruction set, at least everything understood by
				1278	a pre-MMX original Pentium (P54C).</para>
				1279
				1280	<para>Stages 3, 4, 5 and 6 also deal with the following extra
				1281	"instrumentation" uopcodes. They are used to express all the
				1282	definedness-tracking and -checking machinery which valgrind does.
				1283	In later sections we show how to create checking code for each of
				1284	the uopcodes above. Note that these instrumentation uopcodes,
				1285	although some appearing complicated, have been carefully chosen
				1286	so that efficient x86 code can be generated for them. GNU
				1287	superopt v2.5 did a great job helping out here. Anyways, the
				1288	uopcodes are as follows:</para>
				1289
				1290	<itemizedlist>
				1291
				1292	<listitem>
				1293	<para><computeroutput>GETV</computeroutput> and
				1294	<computeroutput>PUTV</computeroutput> are analogues to
				1295	<computeroutput>GET</computeroutput> and
				1296	<computeroutput>PUT</computeroutput> above. They are
				1297	identical except that they move the V bits for the specified
				1298	values back and forth to
				1299	<computeroutput>TempRegs</computeroutput>, rather than moving
				1300	the values themselves.</para>
				1301	</listitem>
				1302
				1303	<listitem>
				1304	<para>Similarly, <computeroutput>LOADV</computeroutput> and
				1305	<computeroutput>STOREV</computeroutput> read and write V bits
				1306	from the synthesised shadow memory that Valgrind maintains.
				1307	In fact they do more than that, since they also do
				1308	address-validity checks, and emit complaints if the
				1309	read/written addresses are unaddressible.</para>
				1310	</listitem>
				1311
				1312	<listitem>
				1313	<para><computeroutput>TESTV</computeroutput>, whose
				1314	parameters are a <computeroutput>TempReg</computeroutput> and
				1315	a size, tests the V bits in the
				1316	<computeroutput>TempReg</computeroutput>, at the specified
				1317	operation size (0/1/2/4 byte) and emits an error if any of
				1318	them indicate undefinedness. This is the only uopcode
				1319	capable of doing such tests.</para>
				1320	</listitem>
				1321
				1322	<listitem>
				1323	<para><computeroutput>SETV</computeroutput>, whose parameters
				1324	are also <computeroutput>TempReg</computeroutput> and a size,
				1325	makes the V bits in the
				1326	<computeroutput>TempReg</computeroutput> indicated
				1327	definedness, at the specified operation size. This is
				1328	usually used to generate the correct V bits for a literal
				1329	value, which is of course fully defined.</para>
				1330	</listitem>
				1331
				1332	<listitem>
				1333	<para><computeroutput>GETVF</computeroutput> and
				1334	<computeroutput>PUTVF</computeroutput> are analogues to
				1335	<computeroutput>GETF</computeroutput> and
				1336	<computeroutput>PUTF</computeroutput>. They move the single
				1337	V bit used to model definedness of
				1338	<computeroutput>%EFLAGS</computeroutput> between its home in
				1339	<computeroutput>VG_(baseBlock)</computeroutput> and the
				1340	specified <computeroutput>TempReg</computeroutput>.</para>
				1341	</listitem>
				1342
				1343	<listitem>
				1344	<para><computeroutput>TAG1</computeroutput> denotes one of a
				1345	family of unary operations on
				1346	<computeroutput>TempReg</computeroutput>s containing V bits.
				1347	Similarly, <computeroutput>TAG2</computeroutput> denotes one
				1348	in a family of binary operations on V bits.</para>
				1349	</listitem>
				1350
				1351	</itemizedlist>
				1352
				1353
				1354	<para>These 10 uopcodes are sufficient to express Valgrind's
				1355	entire definedness-checking semantics. In fact most of the
				1356	interesting magic is done by the
				1357	<computeroutput>TAG1</computeroutput> and
				1358	<computeroutput>TAG2</computeroutput> suboperations.</para>
				1359
				1360	<para>First, however, I need to explain about V-vector operation
				1361	sizes. There are 4 sizes: 1, 2 and 4, which operate on groups of
				1362	8, 16 and 32 V bits at a time, supporting the usual 1, 2 and 4
				1363	byte x86 operations. However there is also the mysterious size
				1364	0, which really means a single V bit. Single V bits are used in
				1365	various circumstances; in particular, the definedness of
				1366	<computeroutput>%EFLAGS</computeroutput> is modelled with a
				1367	single V bit. Now might be a good time to also point out that
				1368	for V bits, 1 means "undefined" and 0 means "defined".
				1369	Similarly, for A bits, 1 means "invalid address" and 0 means
				1370	"valid address". This seems counterintuitive (and so it is), but
				1371	testing against zero on x86s saves instructions compared to
				1372	testing against all 1s, because many ALU operations set the Z
				1373	flag for free, so to speak.</para>
				1374
				1375	<para>With that in mind, the tag ops are:</para>
				1376
				1377	<itemizedlist>
				1378
				1379	<listitem>
				1380	<formalpara>
				1381	<title>(UNARY) Pessimising casts:</title>
				1382	<para><computeroutput>VgT_PCast40</computeroutput>,
				1383	<computeroutput>VgT_PCast20</computeroutput>,
				1384	<computeroutput>VgT_PCast10</computeroutput>,
				1385	<computeroutput>VgT_PCast01</computeroutput>,
				1386	<computeroutput>VgT_PCast02</computeroutput> and
				1387	<computeroutput>VgT_PCast04</computeroutput>. A "pessimising
				1388	cast" takes a V-bit vector at one size, and creates a new one
				1389	at another size, pessimised in the sense that if any of the
				1390	bits in the source vector indicate undefinedness, then all
				1391	the bits in the result indicate undefinedness. In this case
				1392	the casts are all to or from a single V bit, so for example
				1393	<computeroutput>VgT_PCast40</computeroutput> is a pessimising
				1394	cast from 32 bits to 1, whereas
				1395	<computeroutput>VgT_PCast04</computeroutput> simply copies
				1396	the single source V bit into all 32 bit positions in the
				1397	result. Surprisingly, these ops can all be implemented very
				1398	efficiently.</para>
				1399	</formalpara>
				1400
				1401	<para>There are also the pessimising casts
				1402	<computeroutput>VgT_PCast14</computeroutput>, from 8 bits to
				1403	32, <computeroutput>VgT_PCast12</computeroutput>, from 8 bits
				1404	to 16, and <computeroutput>VgT_PCast11</computeroutput>, from
				1405	8 bits to 8. This last one seems nonsensical, but in fact it
				1406	isn't a no-op because, as mentioned above, any undefined (1)
				1407	bits in the source infect the entire result.</para>
				1408	</listitem>
				1409
				1410	<listitem>
				1411	<formalpara>
				1412	<title>(UNARY) Propagating undefinedness upwards in a
				1413	word:</title>
				1414	<para><computeroutput>VgT_Left4</computeroutput>,
				1415	<computeroutput>VgT_Left2</computeroutput> and
				1416	<computeroutput>VgT_Left1</computeroutput>. These are used
				1417	to simulate the worst-case effects of carry propagation in
				1418	adds and subtracts. They return a V vector identical to the
				1419	original, except that if the original contained any undefined
				1420	bits, then it and all bits above it are marked as undefined
				1421	too. Hence the Left bit in the names.</para></formalpara>
				1422	</listitem>
				1423
				1424	<listitem>
				1425	<formalpara>
				1426	<title>(UNARY) Signed and unsigned value widening:</title>
				1427	<para><computeroutput>VgT_SWiden14</computeroutput>,
				1428	<computeroutput>VgT_SWiden24</computeroutput>,
				1429	<computeroutput>VgT_SWiden12</computeroutput>,
				1430	<computeroutput>VgT_ZWiden14</computeroutput>,
				1431	<computeroutput>VgT_ZWiden24</computeroutput> and
				1432	<computeroutput>VgT_ZWiden12</computeroutput>. These mimic
				1433	the definedness effects of standard signed and unsigned
				1434	integer widening. Unsigned widening creates zero bits in the
				1435	new positions, so
				1436	<computeroutput>VgT_ZWiden*</computeroutput> accordingly park
				1437	mark those parts of their argument as defined. Signed
				1438	widening copies the sign bit into the new positions, so
				1439	<computeroutput>VgT_SWiden*</computeroutput> copies the
				1440	definedness of the sign bit into the new positions. Because
				1441	1 means undefined and 0 means defined, these operations can
				1442	(fascinatingly) be done by the same operations which they
				1443	mimic. Go figure.</para>
				1444	</formalpara>
				1445	</listitem>
				1446
				1447	<listitem>
				1448	<formalpara>
				1449	<title>(BINARY) Undefined-if-either-Undefined,
				1450	Defined-if-either-Defined:</title>
				1451	<para><computeroutput>VgT_UifU4</computeroutput>,
				1452	<computeroutput>VgT_UifU2</computeroutput>,
				1453	<computeroutput>VgT_UifU1</computeroutput>,
				1454	<computeroutput>VgT_UifU0</computeroutput>,
				1455	<computeroutput>VgT_DifD4</computeroutput>,
				1456	<computeroutput>VgT_DifD2</computeroutput>,
				1457	<computeroutput>VgT_DifD1</computeroutput>. These do simple
				1458	bitwise operations on pairs of V-bit vectors, with
				1459	<computeroutput>UifU</computeroutput> giving undefined if
				1460	either arg bit is undefined, and
				1461	<computeroutput>DifD</computeroutput> giving defined if
				1462	either arg bit is defined. Abstract interpretation junkies,
				1463	if any make it this far, may like to think of them as meets
				1464	and joins (or is it joins and meets) in the definedness
				1465	lattices.</para>
				1466	</formalpara>
				1467	</listitem>
				1468
				1469	<listitem>
				1470	<formalpara>
				1471	<title>(BINARY; one value, one V bits) Generate argument
				1472	improvement terms for AND and OR</title>
				1473	<para><computeroutput>VgT_ImproveAND4_TQ</computeroutput>,
				1474	<computeroutput>VgT_ImproveAND2_TQ</computeroutput>,
				1475	<computeroutput>VgT_ImproveAND1_TQ</computeroutput>,
				1476	<computeroutput>VgT_ImproveOR4_TQ</computeroutput>,
				1477	<computeroutput>VgT_ImproveOR2_TQ</computeroutput>,
				1478	<computeroutput>VgT_ImproveOR1_TQ</computeroutput>. These
				1479	help out with AND and OR operations. AND and OR have the
				1480	inconvenient property that the definedness of the result
				1481	depends on the actual values of the arguments as well as
				1482	their definedness. At the bit level:</para></formalpara>
				1483	<programlisting><![CDATA[
				1484	1 AND undefined = undefined, but
				1485	0 AND undefined = 0, and
				1486	similarly
				1487	0 OR undefined = undefined, but
				1488	1 OR undefined = 1.]]></programlisting>
				1489
				1490	<para>It turns out that gcc (quite legitimately) generates
				1491	code which relies on this fact, so we have to model it
				1492	properly in order to avoid flooding users with spurious value
				1493	errors. The ultimate definedness result of AND and OR is
				1494	calculated using <computeroutput>UifU</computeroutput> on the
				1495	definedness of the arguments, but we also
				1496	<computeroutput>DifD</computeroutput> in some "improvement"
				1497	terms which take into account the above phenomena.</para>
				1498
				1499	<para><computeroutput>ImproveAND</computeroutput> takes as
				1500	its first argument the actual value of an argument to AND
				1501	(the T) and the definedness of that argument (the Q), and
				1502	returns a V-bit vector which is defined (0) for bits which
				1503	have value 0 and are defined; this, when
				1504	<computeroutput>DifD</computeroutput> into the final result
				1505	causes those bits to be defined even if the corresponding bit
				1506	in the other argument is undefined.</para>
				1507
				1508	<para>The <computeroutput>ImproveOR</computeroutput> ops do
				1509	the dual thing for OR arguments. Note that XOR does not have
				1510	this property that one argument can make the other
				1511	irrelevant, so there is no need for such complexity for
				1512	XOR.</para>
				1513	</listitem>
				1514
				1515	</itemizedlist>
				1516
				1517	<para>That's all the tag ops. If you stare at this long enough,
				1518	and then run Valgrind and stare at the pre- and post-instrumented
				1519	ucode, it should be fairly obvious how the instrumentation
				1520	machinery hangs together.</para>
				1521
				1522	<para>One point, if you do this: in order to make it easy to
				1523	differentiate <computeroutput>TempReg</computeroutput>s carrying
				1524	values from <computeroutput>TempReg</computeroutput>s carrying V
				1525	bit vectors, Valgrind prints the former as (for example)
				1526	<computeroutput>t28</computeroutput> and the latter as
				1527	<computeroutput>q28</computeroutput>; the fact that they carry
				1528	the same number serves to indicate their relationship. This is
				1529	purely for the convenience of the human reader; the register
				1530	allocator and code generator don't regard them as
				1531	different.</para>
				1532
				1533	</sect2>
				1534
				1535
				1536
de	ccde45e	2005-06-12 10:23:23 +0000	[diff] [blame]	1537	<sect2 id="mc-tech-docs.trans" xreflabel="Translation into UCode">
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1538	<title>Translation into UCode</title>
				1539
				1540	<para><computeroutput>VG_(disBB)</computeroutput> allocates a new
				1541	<computeroutput>UCodeBlock</computeroutput> and then uses
				1542	<computeroutput>disInstr</computeroutput> to translate x86
				1543	instructions one at a time into UCode, dumping the result in the
				1544	<computeroutput>UCodeBlock</computeroutput>. This goes on until
				1545	a control-flow transfer instruction is encountered.</para>
				1546
				1547	<para>Despite the large size of
				1548	<filename>vg_to_ucode.c</filename>, this translation is really
				1549	very simple. Each x86 instruction is translated entirely
				1550	independently of its neighbours, merrily allocating new
				1551	<computeroutput>TempReg</computeroutput>s as it goes. The idea
				1552	is to have a simple translator -- in reality, no more than a
				1553	macro-expander -- and the -- resulting bad UCode translation is
				1554	cleaned up by the UCode optimisation phase which follows. To
				1555	give you an idea of some x86 instructions and their translations
				1556	(this is a complete basic block, as Valgrind sees it):</para>
				1557	<programlisting><![CDATA[
				1558	0x40435A50: incl %edx
				1559	0: GETL %EDX, t0
				1560	1: INCL t0 (-wOSZAP)
				1561	2: PUTL t0, %EDX
				1562
				1563	0x40435A51: movsbl (%edx),%eax
				1564	3: GETL %EDX, t2
				1565	4: LDB (t2), t2
				1566	5: WIDENL_Bs t2
				1567	6: PUTL t2, %EAX
				1568
				1569	0x40435A54: testb $0x20, 1(%ecx,%eax,2)
				1570	7: GETL %EAX, t6
				1571	8: GETL %ECX, t8
				1572	9: LEA2L 1(t8,t6,2), t4
				1573	10: LDB (t4), t10
				1574	11: MOVB $0x20, t12
				1575	12: ANDB t12, t10 (-wOSZACP)
				1576	13: INCEIPo $9
				1577
				1578	0x40435A59: jnz-8 0x40435A50
				1579	14: Jnzo $0x40435A50 (-rOSZACP)
				1580	15: JMPo $0x40435A5B]]></programlisting>
				1581
				1582	<para>Notice how the block always ends with an unconditional jump
				1583	to the next block. This is a bit unnecessary, but makes many
				1584	things simpler.</para>
				1585
				1586	<para>Most x86 instructions turn into sequences of
				1587	<computeroutput>GET</computeroutput>,
				1588	<computeroutput>PUT</computeroutput>,
				1589	<computeroutput>LEA1</computeroutput>,
				1590	<computeroutput>LEA2</computeroutput>,
				1591	<computeroutput>LOAD</computeroutput> and
				1592	<computeroutput>STORE</computeroutput>. Some complicated ones
				1593	however rely on calling helper bits of code in
				1594	<filename>vg_helpers.S</filename>. The ucode instructions
				1595	<computeroutput>PUSH</computeroutput>,
				1596	<computeroutput>POP</computeroutput>,
				1597	<computeroutput>CALL</computeroutput>,
				1598	<computeroutput>CALLM_S</computeroutput> and
				1599	<computeroutput>CALLM_E</computeroutput> support this. The
				1600	calling convention is somewhat ad-hoc and is not the C calling
				1601	convention. The helper routines must save all integer registers,
				1602	and the flags, that they use. Args are passed on the stack
				1603	underneath the return address, as usual, and if result(s) are to
				1604	be returned, it (they) are either placed in dummy arg slots
				1605	created by the ucode <computeroutput>PUSH</computeroutput>
				1606	sequence, or just overwrite the incoming args.</para>
				1607
				1608	<para>In order that the instrumentation mechanism can handle
				1609	calls to these helpers,
				1610	<computeroutput>VG_(saneUCodeBlock)</computeroutput> enforces the
				1611	following restrictions on calls to helpers:</para>
				1612
				1613	<itemizedlist>
				1614
				1615	<listitem>
				1616	<para>Each <computeroutput>CALL</computeroutput> uinstr must
				1617	be bracketed by a preceding
				1618	<computeroutput>CALLM_S</computeroutput> marker (dummy
				1619	uinstr) and a trailing
				1620	<computeroutput>CALLM_E</computeroutput> marker. These
				1621	markers are used by the instrumentation mechanism later to
				1622	establish the boundaries of the
				1623	<computeroutput>PUSH</computeroutput>,
				1624	<computeroutput>POP</computeroutput> and
				1625	<computeroutput>CLEAR</computeroutput> sequences for the
				1626	call.</para>
				1627	</listitem>
				1628
				1629	<listitem>
				1630	<para><computeroutput>PUSH</computeroutput>,
				1631	<computeroutput>POP</computeroutput> and
				1632	<computeroutput>CLEAR</computeroutput> may only appear inside
				1633	sections bracketed by
				1634	<computeroutput>CALLM_S</computeroutput> and
				1635	<computeroutput>CALLM_E</computeroutput>, and nowhere else.</para>
				1636	</listitem>
				1637
				1638	<listitem>
				1639	<para>In any such bracketed section, no two
				1640	<computeroutput>PUSH</computeroutput> insns may push the same
				1641	<computeroutput>TempReg</computeroutput>. Dually, no two two
				1642	<computeroutput>POP</computeroutput>s may pop the same
				1643	<computeroutput>TempReg</computeroutput>.</para>
				1644	</listitem>
				1645
				1646	<listitem>
				1647	<para>Finally, although this is not checked, args should be
				1648	removed from the stack with
				1649	<computeroutput>CLEAR</computeroutput>, rather than
				1650	<computeroutput>POP</computeroutput>s into a
				1651	<computeroutput>TempReg</computeroutput> which is not
				1652	subsequently used. This is because the instrumentation
				1653	mechanism assumes that all values
				1654	<computeroutput>POP</computeroutput>ped from the stack are
				1655	actually used.</para>
				1656	</listitem>
				1657
				1658	</itemizedlist>
				1659
				1660	<para>Some of the translations may appear to have redundant
				1661	<computeroutput>TempReg</computeroutput>-to-<computeroutput>TempReg</computeroutput>
				1662	moves. This helps the next phase, UCode optimisation, to
				1663	generate better code.</para>
				1664
				1665	</sect2>
				1666
				1667
				1668
				1669	<sect2 id="mc-tech-docs.optim" xreflabel="UCode optimisation">
				1670	<title>UCode optimisation</title>
				1671
				1672	<para>UCode is then subjected to an improvement pass
				1673	(<computeroutput>vg_improve()</computeroutput>), which blurs the
				1674	boundaries between the translations of the original x86
				1675	instructions. It's pretty straightforward. Three
				1676	transformations are done:</para>
				1677
				1678	<itemizedlist>
				1679
				1680	<listitem>
				1681	<para>Redundant <computeroutput>GET</computeroutput>
				1682	elimination. Actually, more general than that -- eliminates
				1683	redundant fetches of ArchRegs. In our running example,
				1684	uinstr 3 <computeroutput>GET</computeroutput>s
				1685	<computeroutput>%EDX</computeroutput> into
				1686	<computeroutput>t2</computeroutput> despite the fact that, by
				1687	looking at the previous uinstr, it is already in
				1688	<computeroutput>t0</computeroutput>. The
				1689	<computeroutput>GET</computeroutput> is therefore removed,
				1690	and <computeroutput>t2</computeroutput> renamed to
				1691	<computeroutput>t0</computeroutput>. Assuming
				1692	<computeroutput>t0</computeroutput> is allocated to a host
				1693	register, it means the simulated
				1694	<computeroutput>%EDX</computeroutput> will exist in a host
				1695	CPU register for more than one simulated x86 instruction,
				1696	which seems to me to be a highly desirable property.</para>
				1697
				1698	<para>There is some mucking around to do with subregisters;
				1699	<computeroutput>%AL</computeroutput> vs
				1700	<computeroutput>%AH</computeroutput>
				1701	<computeroutput>%AX</computeroutput> vs
				1702	<computeroutput>%EAX</computeroutput> etc. I can't remember
				1703	how it works, but in general we are very conservative, and
				1704	these tend to invalidate the caching.</para>
				1705	</listitem>
				1706
				1707	<listitem>
				1708	<para>Redundant <computeroutput>PUT</computeroutput>
				1709	elimination. This annuls
				1710	<computeroutput>PUT</computeroutput>s of values back to
				1711	simulated CPU registers if a later
				1712	<computeroutput>PUT</computeroutput> would overwrite the
				1713	earlier <computeroutput>PUT</computeroutput> value, and there
				1714	is no intervening reads of the simulated register
				1715	(<computeroutput>ArchReg</computeroutput>).</para>
				1716
				1717	<para>As before, we are paranoid when faced with subregister
				1718	references. Also, <computeroutput>PUT</computeroutput>s of
				1719	<computeroutput>%ESP</computeroutput> are never annulled,
				1720	because it is vital the instrumenter always has an up-to-date
				1721	<computeroutput>%ESP</computeroutput> value available,
				1722	<computeroutput>%ESP</computeroutput> changes affect
				1723	addressibility of the memory around the simulated stack
				1724	pointer.</para>
				1725
				1726	<para>The implication of the above paragraph is that the
				1727	simulated machine's registers are only lazily updated once
				1728	the above two optimisation phases have run, with the
				1729	exception of <computeroutput>%ESP</computeroutput>.
				1730	<computeroutput>TempReg</computeroutput>s go dead at the end
				1731	of every basic block, from which is is inferrable that any
				1732	<computeroutput>TempReg</computeroutput> caching a simulated
				1733	CPU reg is flushed (back into the relevant
				1734	<computeroutput>VG_(baseBlock)</computeroutput> slot) at the
				1735	end of every basic block. The further implication is that
				1736	the simulated registers are only up-to-date at in between
				1737	basic blocks, and not at arbitrary points inside basic
				1738	blocks. And the consequence of that is that we can only
				1739	deliver signals to the client in between basic blocks. None
				1740	of this seems any problem in practice.</para>
				1741	</listitem>
				1742
				1743	<listitem>
				1744	<para>Finally there is a simple def-use thing for condition
				1745	codes. If an earlier uinstr writes the condition codes, and
				1746	the next uinsn along which actually cares about the condition
				1747	codes writes the same or larger set of them, but does not
				1748	read any, the earlier uinsn is marked as not writing any
				1749	condition codes. This saves a lot of redundant cond-code
				1750	saving and restoring.</para>
				1751	</listitem>
				1752
				1753	</itemizedlist>
				1754
				1755	<para>The effect of these transformations on our short block is
				1756	rather unexciting, and shown below. On longer basic blocks they
				1757	can dramatically improve code quality.</para>
				1758
				1759	<programlisting><![CDATA[
				1760	at 3: delete GET, rename t2 to t0 in (4 .. 6)
				1761	at 7: delete GET, rename t6 to t0 in (8 .. 9)
				1762	at 1: annul flag write OSZAP due to later OSZACP
				1763
				1764	Improved code:
				1765	0: GETL %EDX, t0
				1766	1: INCL t0
				1767	2: PUTL t0, %EDX
				1768	4: LDB (t0), t0
				1769	5: WIDENL_Bs t0
				1770	6: PUTL t0, %EAX
				1771	8: GETL %ECX, t8
				1772	9: LEA2L 1(t8,t0,2), t4
				1773	10: LDB (t4), t10
				1774	11: MOVB $0x20, t12
				1775	12: ANDB t12, t10 (-wOSZACP)
				1776	13: INCEIPo $9
				1777	14: Jnzo $0x40435A50 (-rOSZACP)
				1778	15: JMPo $0x40435A5B]]></programlisting>
				1779
				1780	</sect2>
				1781
				1782
				1783
				1784	<sect2 id="mc-tech-docs.instrum" xreflabel="UCode instrumentation">
				1785	<title>UCode instrumentation</title>
				1786
				1787	<para>Once you understand the meaning of the instrumentation
				1788	uinstrs, discussed in detail above, the instrumentation scheme is
				1789	fairly straightforward. Each uinstr is instrumented in
				1790	isolation, and the instrumentation uinstrs are placed before the
				1791	original uinstr. Our running example continues below. I have
				1792	placed a blank line after every original ucode, to make it easier
				1793	to see which instrumentation uinstrs correspond to which
				1794	originals.</para>
				1795
				1796	<para>As mentioned somewhere above,
				1797	<computeroutput>TempReg</computeroutput>s carrying values have
				1798	names like <computeroutput>t28</computeroutput>, and each one has
				1799	a shadow carrying its V bits, with names like
				1800	<computeroutput>q28</computeroutput>. This pairing aids in
				1801	reading instrumented ucode.</para>
				1802
				1803	<para>One decision about all this is where to have "observation
				1804	points", that is, where to check that V bits are valid. I use a
				1805	minimalistic scheme, only checking where a failure of validity
				1806	could cause the original program to (seg)fault. So the use of
				1807	values as memory addresses causes a check, as do conditional
				1808	jumps (these cause a check on the definedness of the condition
				1809	codes). And arguments <computeroutput>PUSH</computeroutput>ed
				1810	for helper calls are checked, hence the weird restrictions on
				1811	help call preambles described above.</para>
				1812
				1813	<para>Another decision is that once a value is tested, it is
				1814	thereafter regarded as defined, so that we do not emit multiple
				1815	undefined-value errors for the same undefined value. That means
				1816	that <computeroutput>TESTV</computeroutput> uinstrs are always
				1817	followed by <computeroutput>SETV</computeroutput> on the same
				1818	(shadow) <computeroutput>TempReg</computeroutput>s. Most of
				1819	these <computeroutput>SETV</computeroutput>s are redundant and
				1820	are removed by the post-instrumentation cleanup phase.</para>
				1821
				1822	<para>The instrumentation for calling helper functions deserves
				1823	further comment. The definedness of results from a helper is
				1824	modelled using just one V bit. So, in short, we do pessimising
				1825	casts of the definedness of all the args, down to a single bit,
				1826	and then <computeroutput>UifU</computeroutput> these bits
				1827	together. So this single V bit will say "undefined" if any part
				1828	of any arg is undefined. This V bit is then pessimally cast back
				1829	up to the result(s) sizes, as needed. If, by seeing that all the
				1830	args are got rid of with <computeroutput>CLEAR</computeroutput>
				1831	and none with <computeroutput>POP</computeroutput>, Valgrind sees
				1832	that the result of the call is not actually used, it immediately
				1833	examines the result V bit with a
				1834	<computeroutput>TESTV</computeroutput> --
				1835	<computeroutput>SETV</computeroutput> pair. If it did not do
				1836	this, there would be no observation point to detect that the some
				1837	of the args to the helper were undefined. Of course, if the
				1838	helper's results are indeed used, we don't do this, since the
				1839	result usage will presumably cause the result definedness to be
				1840	checked at some suitable future point.</para>
				1841
				1842	<para>In general Valgrind tries to track definedness on a
				1843	bit-for-bit basis, but as the above para shows, for calls to
				1844	helpers we throw in the towel and approximate down to a single
				1845	bit. This is because it's too complex and difficult to track
				1846	bit-level definedness through complex ops such as integer
				1847	multiply and divide, and in any case there is no reasonable code
				1848	fragments which attempt to (eg) multiply two partially-defined
				1849	values and end up with something meaningful, so there seems
				1850	little point in modelling multiplies, divides, etc, in that level
				1851	of detail.</para>
				1852
				1853	<para>Integer loads and stores are instrumented with firstly a
				1854	test of the definedness of the address, followed by a
				1855	<computeroutput>LOADV</computeroutput> or
				1856	<computeroutput>STOREV</computeroutput> respectively. These turn
				1857	into calls to (for example)
				1858	<computeroutput>VG_(helperc_LOADV4)</computeroutput>. These
				1859	helpers do two things: they perform an address-valid check, and
				1860	they load or store V bits from/to the relevant address in the
				1861	(simulated V-bit) memory.</para>
				1862
				1863	<para>FPU loads and stores are different. As above the
				1864	definedness of the address is first tested. However, the helper
				1865	routine for FPU loads
				1866	(<computeroutput>VGM_(fpu_read_check)</computeroutput>) emits an
				1867	error if either the address is invalid or the referenced area
				1868	contains undefined values. It has to do this because we do not
				1869	simulate the FPU at all, and so cannot track definedness of
				1870	values loaded into it from memory, so we have to check them as
				1871	soon as they are loaded into the FPU, ie, at this point. We
				1872	notionally assume that everything in the FPU is defined.</para>
				1873
				1874	<para>It follows therefore that FPU writes first check the
				1875	definedness of the address, then the validity of the address, and
				1876	finally mark the written bytes as well-defined.</para>
				1877
				1878	<para>If anyone is inspired to extend Valgrind to MMX/SSE insns,
				1879	I suggest you use the same trick. It works provided that the
				1880	FPU/MMX unit is not used to merely as a conduit to copy partially
				1881	undefined data from one place in memory to another.
				1882	Unfortunately the integer CPU is used like that (when copying C
				1883	structs with holes, for example) and this is the cause of much of
				1884	the elaborateness of the instrumentation here described.</para>
				1885
				1886	<para><computeroutput>vg_instrument()</computeroutput> in
				1887	<filename>vg_translate.c</filename> actually does the
				1888	instrumentation. There are comments explaining how each uinstr
				1889	is handled, so we do not repeat that here. As explained already,
				1890	it is bit-accurate, except for calls to helper functions.
				1891	Unfortunately the x86 insns
				1892	<computeroutput>bt/bts/btc/btr</computeroutput> are done by
				1893	helper fns, so bit-level accuracy is lost there. This should be
				1894	fixed by doing them inline; it will probably require adding a
				1895	couple new uinstrs. Also, left and right rotates through the
				1896	carry flag (x86 <computeroutput>rcl</computeroutput> and
				1897	<computeroutput>rcr</computeroutput>) are approximated via a
				1898	single V bit; so far this has not caused anyone to complain. The
				1899	non-carry rotates, <computeroutput>rol</computeroutput> and
				1900	<computeroutput>ror</computeroutput>, are much more common and
				1901	are done exactly. Re-visiting the instrumentation for AND and
				1902	OR, they seem rather verbose, and I wonder if it could be done
				1903	more concisely now.</para>
				1904
				1905	<para>The lowercase <computeroutput>o</computeroutput> on many of
				1906	the uopcodes in the running example indicates that the size field
				1907	is zero, usually meaning a single-bit operation.</para>
				1908
				1909	<para>Anyroads, the post-instrumented version of our running
				1910	example looks like this:</para>
				1911
				1912	<programlisting><![CDATA[
				1913	Instrumented code:
				1914	0: GETVL %EDX, q0
				1915	1: GETL %EDX, t0
				1916
				1917	2: TAG1o q0 = Left4 ( q0 )
				1918	3: INCL t0
				1919
				1920	4: PUTVL q0, %EDX
				1921	5: PUTL t0, %EDX
				1922
				1923	6: TESTVL q0
				1924	7: SETVL q0
				1925	8: LOADVB (t0), q0
				1926	9: LDB (t0), t0
				1927
				1928	10: TAG1o q0 = SWiden14 ( q0 )
				1929	11: WIDENL_Bs t0
				1930
				1931	12: PUTVL q0, %EAX
				1932	13: PUTL t0, %EAX
				1933
				1934	14: GETVL %ECX, q8
				1935	15: GETL %ECX, t8
				1936
				1937	16: MOVL q0, q4
				1938	17: SHLL $0x1, q4
				1939	18: TAG2o q4 = UifU4 ( q8, q4 )
				1940	19: TAG1o q4 = Left4 ( q4 )
				1941	20: LEA2L 1(t8,t0,2), t4
				1942
				1943	21: TESTVL q4
				1944	22: SETVL q4
				1945	23: LOADVB (t4), q10
				1946	24: LDB (t4), t10
				1947
				1948	25: SETVB q12
				1949	26: MOVB $0x20, t12
				1950
				1951	27: MOVL q10, q14
				1952	28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
				1953	29: TAG2o q10 = UifU1 ( q12, q10 )
				1954	30: TAG2o q10 = DifD1 ( q14, q10 )
				1955	31: MOVL q12, q14
				1956	32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 )
				1957	33: TAG2o q10 = DifD1 ( q14, q10 )
				1958	34: MOVL q10, q16
				1959	35: TAG1o q16 = PCast10 ( q16 )
				1960	36: PUTVFo q16
				1961	37: ANDB t12, t10 (-wOSZACP)
				1962
				1963	38: INCEIPo $9
				1964
				1965	39: GETVFo q18
				1966	40: TESTVo q18
				1967	41: SETVo q18
				1968	42: Jnzo $0x40435A50 (-rOSZACP)
				1969
				1970	43: JMPo $0x40435A5B]]></programlisting>
				1971
				1972	</sect2>
				1973
				1974
				1975
				1976	<sect2 id="mc-tech-docs.cleanup"
				1977	xreflabel="UCode post-instrumentation cleanup">
				1978	<title>UCode post-instrumentation cleanup</title>
				1979
				1980	<para>This pass, coordinated by
				1981	<computeroutput>vg_cleanup()</computeroutput>, removes redundant
				1982	definedness computation created by the simplistic instrumentation
				1983	pass. It consists of two passes,
				1984	<computeroutput>vg_propagate_definedness()</computeroutput>
				1985	followed by
				1986	<computeroutput>vg_delete_redundant_SETVs</computeroutput>.</para>
				1987
				1988	<para><computeroutput>vg_propagate_definedness()</computeroutput>
				1989	is a simple constant-propagation and constant-folding pass. It
				1990	tries to determine which
				1991	<computeroutput>TempReg</computeroutput>s containing V bits will
				1992	always indicate "fully defined", and it propagates this
				1993	information as far as it can, and folds out as many operations as
				1994	possible. For example, the instrumentation for an ADD of a
				1995	literal to a variable quantity will be reduced down so that the
				1996	definedness of the result is simply the definedness of the
				1997	variable quantity, since the literal is by definition fully
				1998	defined.</para>
				1999
				2000	<para><computeroutput>vg_delete_redundant_SETVs</computeroutput>
				2001	removes <computeroutput>SETV</computeroutput>s on shadow
				2002	<computeroutput>TempReg</computeroutput>s for which the next
				2003	action is a write. I don't think there's anything else worth
				2004	saying about this; it is simple. Read the sources for
				2005	details.</para>
				2006
				2007	<para>So the cleaned-up running example looks like this. As
				2008	above, I have inserted line breaks after every original
				2009	(non-instrumentation) uinstr to aid readability. As with
				2010	straightforward ucode optimisation, the results in this block are
				2011	undramatic because it is so short; longer blocks benefit more
				2012	because they have more redundancy which gets eliminated.</para>
				2013
				2014	<programlisting><![CDATA[
				2015	at 29: delete UifU1 due to defd arg1
				2016	at 32: change ImproveAND1_TQ to MOV due to defd arg2
				2017	at 41: delete SETV
				2018	at 31: delete MOV
				2019	at 25: delete SETV
				2020	at 22: delete SETV
				2021	at 7: delete SETV
				2022
				2023	0: GETVL %EDX, q0
				2024	1: GETL %EDX, t0
				2025
				2026	2: TAG1o q0 = Left4 ( q0 )
				2027	3: INCL t0
				2028
				2029	4: PUTVL q0, %EDX
				2030	5: PUTL t0, %EDX
				2031
				2032	6: TESTVL q0
				2033	8: LOADVB (t0), q0
				2034	9: LDB (t0), t0
				2035
				2036	10: TAG1o q0 = SWiden14 ( q0 )
				2037	11: WIDENL_Bs t0
				2038
				2039	12: PUTVL q0, %EAX
				2040	13: PUTL t0, %EAX
				2041
				2042	14: GETVL %ECX, q8
				2043	15: GETL %ECX, t8
				2044
				2045	16: MOVL q0, q4
				2046	17: SHLL $0x1, q4
				2047	18: TAG2o q4 = UifU4 ( q8, q4 )
				2048	19: TAG1o q4 = Left4 ( q4 )
				2049	20: LEA2L 1(t8,t0,2), t4
				2050
				2051	21: TESTVL q4
				2052	23: LOADVB (t4), q10
				2053	24: LDB (t4), t10
				2054
				2055	26: MOVB $0x20, t12
				2056
				2057	27: MOVL q10, q14
				2058	28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
				2059	30: TAG2o q10 = DifD1 ( q14, q10 )
				2060	32: MOVL t12, q14
				2061	33: TAG2o q10 = DifD1 ( q14, q10 )
				2062	34: MOVL q10, q16
				2063	35: TAG1o q16 = PCast10 ( q16 )
				2064	36: PUTVFo q16
				2065	37: ANDB t12, t10 (-wOSZACP)
				2066
				2067	38: INCEIPo $9
				2068	39: GETVFo q18
				2069	40: TESTVo q18
				2070	42: Jnzo $0x40435A50 (-rOSZACP)
				2071
				2072	43: JMPo $0x40435A5B]]></programlisting>
				2073
				2074	</sect2>
				2075
				2076
				2077
				2078	<sect2 id="mc-tech-docs.transfrom" xreflabel="Translation from UCode">
				2079	<title>Translation from UCode</title>
				2080
				2081	<para>This is all very simple, even though
				2082	<filename>vg_from_ucode.c</filename> is a big file.
				2083	Position-independent x86 code is generated into a dynamically
				2084	allocated array <computeroutput>emitted_code</computeroutput>;
				2085	this is doubled in size when it overflows. Eventually the array
				2086	is handed back to the caller of
				2087	<computeroutput>VG_(translate)</computeroutput>, who must copy
				2088	the result into TC and TT, and free the array.</para>
				2089
				2090	<para>This file is structured into four layers of abstraction,
				2091	which, thankfully, are glued back together with extensive
				2092	<computeroutput>__inline__</computeroutput> directives. From the
				2093	bottom upwards:</para>
				2094
				2095	<itemizedlist>
				2096
				2097	<listitem>
				2098	<para>Address-mode emitters,
				2099	<computeroutput>emit_amode_regmem_reg</computeroutput> et
				2100	al.</para>
				2101	</listitem>
				2102
				2103	<listitem>
				2104	<para>Emitters for specific x86 instructions. There are
				2105	quite a lot of these, with names such as
				2106	<computeroutput>emit_movv_offregmem_reg</computeroutput>.
				2107	The <computeroutput>v</computeroutput> suffix is Intel
				2108	parlance for a 16/32 bit insn; there are also
				2109	<computeroutput>b</computeroutput> suffixes for 8 bit
				2110	insns.</para>
				2111	</listitem>
				2112
				2113	<listitem>
				2114	<para>The next level up are the
				2115	<computeroutput>synth_*</computeroutput> functions, which
				2116	synthesise possibly a sequence of raw x86 instructions to do
				2117	some simple task. Some of these are quite complex because
				2118	they have to work around Intel's silly restrictions on
				2119	subregister naming. See
				2120	<computeroutput>synth_nonshiftop_reg_reg</computeroutput> for
				2121	example.</para>
				2122	</listitem>
				2123
				2124	<listitem>
				2125	<para>Finally, at the top of the heap, we have
				2126	<computeroutput>emitUInstr()</computeroutput>, which emits
				2127	code for a single uinstr.</para>
				2128	</listitem>
				2129
				2130	</itemizedlist>
				2131
				2132	<para>Some comments:</para>
				2133
				2134	<itemizedlist>
				2135
				2136	<listitem>
				2137	<para>The hack for FPU instructions becomes apparent here.
				2138	To do a <computeroutput>FPU</computeroutput> ucode
				2139	instruction, we load the simulated FPU's state into from its
				2140	<computeroutput>VG_(baseBlock)</computeroutput> into the real
				2141	FPU using an x86 <computeroutput>frstor</computeroutput>
				2142	insn, do the ucode <computeroutput>FPU</computeroutput> insn
				2143	on the real CPU, and write the updated FPU state back into
				2144	<computeroutput>VG_(baseBlock)</computeroutput> using an
				2145	<computeroutput>fnsave</computeroutput> instruction. This is
				2146	pretty brutal, but is simple and it works, and even seems
				2147	tolerably efficient. There is no attempt to cache the
				2148	simulated FPU state in the real FPU over multiple
				2149	back-to-back ucode FPU instructions.</para>
				2150
				2151	<para><computeroutput>FPU_R</computeroutput> and
				2152	<computeroutput>FPU_W</computeroutput> are also done this
				2153	way, with the minor complication that we need to patch in
				2154	some addressing mode bits so the resulting insn knows the
				2155	effective address to use. This is easy because of the
				2156	regularity of the x86 FPU instruction encodings.</para>
				2157	</listitem>
				2158
				2159	<listitem>
				2160	<para>An analogous trick is done with ucode insns which
				2161	claim, in their <computeroutput>flags_r</computeroutput> and
				2162	<computeroutput>flags_w</computeroutput> fields, that they
				2163	read or write the simulated
				2164	<computeroutput>%EFLAGS</computeroutput>. For such cases we
				2165	first copy the simulated
				2166	<computeroutput>%EFLAGS</computeroutput> into the real
				2167	<computeroutput>%eflags</computeroutput>, then do the insn,
				2168	then, if the insn says it writes the flags, copy back to
				2169	<computeroutput>%EFLAGS</computeroutput>. This is a bit
				2170	expensive, which is why the ucode optimisation pass goes to
				2171	some effort to remove redundant flag-update annotations.</para>
				2172	</listitem>
				2173
				2174	</itemizedlist>
				2175
				2176	<para>And so ... that's the end of the documentation for the
				2177	instrumentating translator! It's really not that complex,
				2178	because it's composed as a sequence of simple(ish) self-contained
				2179	transformations on straight-line blocks of code.</para>
				2180
				2181	</sect2>
				2182
				2183
				2184
				2185	<sect2 id="mc-tech-docs.dispatch" xreflabel="Top-level dispatch loop">
				2186	<title>Top-level dispatch loop</title>
				2187
				2188	<para>Urk. In <computeroutput>VG_(toploop)</computeroutput>.
				2189	This is basically boring and unsurprising, not to mention fiddly
				2190	and fragile. It needs to be cleaned up.</para>
				2191
				2192	<para>The only perhaps surprise is that the whole thing is run on
				2193	top of a <computeroutput>setjmp</computeroutput>-installed
				2194	exception handler, because, supposing a translation got a
				2195	segfault, we have to bail out of the Valgrind-supplied exception
				2196	handler <computeroutput>VG_(oursignalhandler)</computeroutput>
				2197	and immediately start running the client's segfault handler, if
				2198	it has one. In particular we can't finish the current basic
				2199	block and then deliver the signal at some convenient future
				2200	point, because signals like SIGILL, SIGSEGV and SIGBUS mean that
				2201	the faulting insn should not simply be re-tried. (I'm sure there
				2202	is a clearer way to explain this).</para>
				2203
				2204	</sect2>
				2205
				2206
				2207
				2208	<sect2 id="mc-tech-docs.lazy"
				2209	xreflabel="Lazy updates of the simulated program counter">
				2210	<title>Lazy updates of the simulated program counter</title>
				2211
				2212	<para>Simulated <computeroutput>%EIP</computeroutput> is not
				2213	updated after every simulated x86 insn as this was regarded as
				2214	too expensive. Instead ucode
				2215	<computeroutput>INCEIP</computeroutput> insns move it along as
				2216	and when necessary. Currently we don't allow it to fall more
				2217	than 4 bytes behind reality (see
				2218	<computeroutput>VG_(disBB)</computeroutput> for the way this
				2219	works).</para>
				2220
				2221	<para>Note that <computeroutput>%EIP</computeroutput> is always
				2222	brought up to date by the inner dispatch loop in
				2223	<computeroutput>VG_(dispatch)</computeroutput>, so that if the
				2224	client takes a fault we know at least which basic block this
				2225	happened in.</para>
				2226
				2227	</sect2>
				2228
				2229
				2230
				2231	<sect2 id="mc-tech-docs.signals" xreflabel="Signals">
				2232	<title>Signals</title>
				2233
				2234	<para>Horrible, horrible. <filename>vg_signals.c</filename>.
				2235	Basically, since we have to intercept all system calls anyway, we
				2236	can see when the client tries to install a signal handler. If it
				2237	does so, we make a note of what the client asked to happen, and
				2238	ask the kernel to route the signal to our own signal handler,
				2239	<computeroutput>VG_(oursignalhandler)</computeroutput>. This
				2240	simply notes the delivery of signals, and returns.</para>
				2241
				2242	<para>Every 1000 basic blocks, we see if more signals have
				2243	arrived. If so,
				2244	<computeroutput>VG_(deliver_signals)</computeroutput> builds
				2245	signal delivery frames on the client's stack, and allows their
				2246	handlers to be run. Valgrind places in these signal delivery
				2247	frames a bogus return address,
				2248	<computeroutput>VG_(signalreturn_bogusRA)</computeroutput>, and
				2249	checks all jumps to see if any jump to it. If so, this is a sign
				2250	that a signal handler is returning, and if so Valgrind removes
				2251	the relevant signal frame from the client's stack, restores the
				2252	from the signal frame the simulated state before the signal was
				2253	delivered, and allows the client to run onwards. We have to do
				2254	it this way because some signal handlers never return, they just
				2255	<computeroutput>longjmp()</computeroutput>, which nukes the
				2256	signal delivery frame.</para>
				2257
				2258	<para>The Linux kernel has a different but equally horrible hack
				2259	for detecting signal handler returns. Discovering it is left as
				2260	an exercise for the reader.</para>
				2261
				2262	</sect2>
				2263
				2264
				2265	<sect2 id="mc-tech-docs.todo">
				2266	<title>To be written</title>
				2267
				2268	<para>The following is a list of as-yet-not-written stuff. Apologies.</para>
				2269	<orderedlist>
				2270	<listitem>
				2271	<para>The translation cache and translation table</para>
				2272	</listitem>
				2273	<listitem>
				2274	<para>Exceptions, creating new translations</para>
				2275	</listitem>
				2276	<listitem>
				2277	<para>Self-modifying code</para>
				2278	</listitem>
				2279	<listitem>
				2280	<para>Errors, error contexts, error reporting, suppressions</para>
				2281	</listitem>
				2282	<listitem>
				2283	<para>Client malloc/free</para>
				2284	</listitem>
				2285	<listitem>
				2286	<para>Low-level memory management</para>
				2287	</listitem>
				2288	<listitem>
				2289	<para>A and V bitmaps</para>
				2290	</listitem>
				2291	<listitem>
				2292	<para>Symbol table management</para>
				2293	</listitem>
				2294	<listitem>
				2295	<para>Dealing with system calls</para>
				2296	</listitem>
				2297	<listitem>
				2298	<para>Namespace management</para>
				2299	</listitem>
				2300	<listitem>
				2301	<para>GDB attaching</para>
				2302	</listitem>
				2303	<listitem>
				2304	<para>Non-dependence on glibc or anything else</para>
				2305	</listitem>
				2306	<listitem>
				2307	<para>The leak detector</para>
				2308	</listitem>
				2309	<listitem>
				2310	<para>Performance problems</para>
				2311	</listitem>
				2312	<listitem>
				2313	<para>Continuous sanity checking</para>
				2314	</listitem>
				2315	<listitem>
				2316	<para>Tracing, or not tracing, child processes</para>
				2317	</listitem>
				2318	<listitem>
				2319	<para>Assembly glue for syscalls</para>
				2320	</listitem>
				2321	</orderedlist>
				2322
				2323	</sect2>
				2324
				2325	</sect1>
				2326
				2327
				2328
				2329
				2330	<sect1 id="mc-tech-docs.extensions" xreflabel="Extensions">
				2331	<title>Extensions</title>
				2332
				2333	<para>Some comments about Stuff To Do.</para>
				2334
				2335	<sect2 id="mc-tech-docs.bugs" xreflabel="Bugs">
				2336	<title>Bugs</title>
				2337
				2338	<para>Stephan Kulow and Marc Mutz report problems with kmail in
				2339	KDE 3 CVS (RC2 ish) when run on Valgrind. Stephan has it
				2340	deadlocking; Marc has it looping at startup. I can't repro
				2341	either behaviour. Needs repro-ing and fixing.</para>
				2342
				2343	</sect2>
				2344
				2345
				2346	<sect2 id="mc-tech-docs.threads" xreflabel="Threads">
				2347	<title>Threads</title>
				2348
				2349	<para>Doing a good job of thread support strikes me as almost a
				2350	research-level problem. The central issues are how to do fast
				2351	cheap locking of the
				2352	<computeroutput>VG_(primary_map)</computeroutput> structure,
				2353	whether or not accesses to the individual secondary maps need
				2354	locking, what race-condition issues result, and whether the
				2355	already-nasty mess that is the signal simulator needs further
				2356	hackery.</para>
				2357
				2358	<para>I realise that threads are the most-frequently-requested
				2359	feature, and I am thinking about it all. If you have guru-level
				2360	understanding of fast mutual exclusion mechanisms and race
				2361	conditions, I would be interested in hearing from you.</para>
				2362
				2363	</sect2>
				2364
				2365
				2366
				2367	<sect2 id="mc-tech-docs.verify" xreflabel="Verification suite">
				2368	<title>Verification suite</title>
				2369
				2370	<para>Directory <computeroutput>tests/</computeroutput> contains
				2371	various ad-hoc tests for Valgrind. However, there is no
				2372	systematic verification or regression suite, that, for example,
				2373	exercises all the stuff in <filename>vg_memory.c</filename>, to
				2374	ensure that illegal memory accesses and undefined value uses are
				2375	detected as they should be. It would be good to have such a
				2376	suite.</para>
				2377
				2378	</sect2>
				2379
				2380
				2381	<sect2 id="mc-tech-docs.porting" xreflabel="Porting to other platforms">
				2382	<title>Porting to other platforms</title>
				2383
				2384	<para>It would be great if Valgrind was ported to FreeBSD and x86
				2385	NetBSD, and to x86 OpenBSD, if it's possible (doesn't OpenBSD use
				2386	a.out-style executables, not ELF ?)</para>
				2387
				2388	<para>The main difficulties, for an x86-ELF platform, seem to
				2389	be:</para>
				2390
				2391	<itemizedlist>
				2392
				2393	<listitem>
				2394	<para>You'd need to rewrite the
				2395	<computeroutput>/proc/self/maps</computeroutput> parser
				2396	(<filename>vg_procselfmaps.c</filename>). Easy.</para>
				2397	</listitem>
				2398
				2399	<listitem>
				2400	<para>You'd need to rewrite
				2401	<filename>vg_syscall_mem.c</filename>, or, more specifically,
				2402	provide one for your OS. This is tedious, but you can
				2403	implement syscalls on demand, and the Linux kernel interface
				2404	is, for the most part, going to look very similar to the *BSD
				2405	interfaces, so it's really a copy-paste-and-modify-on-demand
				2406	job. As part of this, you'd need to supply a new
				2407	<filename>vg_kerneliface.h</filename> file.</para>
				2408	</listitem>
				2409
				2410	<listitem>
				2411	<para>You'd also need to change the syscall wrappers for
				2412	Valgrind's internal use, in
				2413	<filename>vg_mylibc.c</filename>.</para>
				2414	</listitem>
				2415
				2416	</itemizedlist>
				2417
				2418	<para>All in all, I think a port to x86-ELF *BSDs is not really
				2419	very difficult, and in some ways I would like to see it happen,
				2420	because that would force a more clear factoring of Valgrind into
				2421	platform dependent and independent pieces. Not to mention, *BSD
				2422	folks also deserve to use Valgrind just as much as the Linux crew
				2423	do.</para>
				2424
				2425	</sect2>
				2426
				2427	</sect1>
				2428
				2429
				2430
				2431	<sect1 id="mc-tech-docs.easystuff"
				2432	xreflabel="Easy stuff which ought to be done">
				2433	<title>Easy stuff which ought to be done</title>
				2434
				2435
				2436	<sect2 id="mc-tech-docs.mmx" xreflabel="MMX Instructions">
				2437	<title>MMX Instructions</title>
				2438
				2439	<para>MMX insns should be supported, using the same trick as for
				2440	FPU insns. If the MMX registers are not used to copy
				2441	uninitialised junk from one place to another in memory, this
				2442	means we don't have to actually simulate the internal MMX unit
				2443	state, so the FPU hack applies. This should be fairly
				2444	easy.</para>
				2445
				2446	</sect2>
				2447
				2448
				2449	<sect2 id="mc-tech-docs.fixstabs" xreflabel="Fix stabs-info Reader">
				2450	<title>Fix stabs-info reader</title>
				2451
				2452	<para>The machinery in <filename>vg_symtab2.c</filename> which
				2453	reads "stabs" style debugging info is pretty weak. It usually
				2454	correctly translates simulated program counter values into line
				2455	numbers and procedure names, but the file name is often
				2456	completely wrong. I think the logic used to parse "stabs"
				2457	entries is weak. It should be fixed. The simplest solution,
				2458	IMO, is to copy either the logic or simply the code out of GNU
				2459	binutils which does this; since GDB can clearly get it right,
				2460	binutils (or GDB?) must have code to do this somewhere.</para>
				2461
				2462	</sect2>
				2463
				2464
				2465
				2466	<sect2 id="mc-tech-docs.x86instr" xreflabel="BT/BTC/BTS/BTR">
				2467	<title>BT/BTC/BTS/BTR</title>
				2468
				2469	<para>These are x86 instructions which test, complement, set, or
				2470	reset, a single bit in a word. At the moment they are both
				2471	incorrectly implemented and incorrectly instrumented.</para>
				2472
				2473	<para>The incorrect instrumentation is due to use of helper
				2474	functions. This means we lose bit-level definedness tracking,
				2475	which could wind up giving spurious uninitialised-value use
				2476	errors. The Right Thing to do is to invent a couple of new
				2477	UOpcodes, I think <computeroutput>GET_BIT</computeroutput> and
				2478	<computeroutput>SET_BIT</computeroutput>, which can be used to
				2479	implement all 4 x86 insns, get rid of the helpers, and give
				2480	bit-accurate instrumentation rules for the two new
				2481	UOpcodes.</para>
				2482
				2483	<para>I realised the other day that they are mis-implemented too.
				2484	The x86 insns take a bit-index and a register or memory location
				2485	to access. For registers the bit index clearly can only be in
				2486	the range zero to register-width minus 1, and I assumed the same
				2487	applied to memory locations too. But evidently not; for memory
				2488	locations the index can be arbitrary, and the processor will
				2489	index arbitrarily into memory as a result. This too should be
				2490	fixed. Sigh. Presumably indexing outside the immediate word is
				2491	not actually used by any programs yet tested on Valgrind, for
				2492	otherwise they (presumably) would simply not work at all. If you
				2493	plan to hack on this, first check the Intel docs to make sure my
				2494	understanding is really correct.</para>
				2495
				2496	</sect2>
				2497
				2498
				2499	<sect2 id="mc-tech-docs.prefetch" xreflabel="Using PREFETCH Instructions">
				2500	<title>Using PREFETCH Instructions</title>
				2501
				2502	<para>Here's a small but potentially interesting project for
				2503	performance junkies. Experiments with valgrind's code generator
				2504	and optimiser(s) suggest that reducing the number of instructions
				2505	executed in the translations and mem-check helpers gives
				2506	disappointingly small performance improvements. Perhaps this is
				2507	because performance of Valgrindified code is limited by cache
				2508	misses. After all, each read in the original program now gives
				2509	rise to at least three reads, one for the
				2510	<computeroutput>VG_(primary_map)</computeroutput>, one of the
				2511	resulting secondary, and the original. Not to mention, the
				2512	instrumented translations are 13 to 14 times larger than the
				2513	originals. All in all one would expect the memory system to be
				2514	hammered to hell and then some.</para>
				2515
				2516	<para>So here's an idea. An x86 insn involving a read from
				2517	memory, after instrumentation, will turn into ucode of the
				2518	following form:</para>
				2519	<programlisting><![CDATA[
				2520	... calculate effective addr, into ta and qa ...
				2521	TESTVL qa -- is the addr defined?
				2522	LOADV (ta), qloaded -- fetch V bits for the addr
				2523	LOAD (ta), tloaded -- do the original load]]></programlisting>
				2524
				2525	<para>At the point where the
				2526	<computeroutput>LOADV</computeroutput> is done, we know the
				2527	actual address (<computeroutput>ta</computeroutput>) from which
				2528	the real <computeroutput>LOAD</computeroutput> will be done. We
				2529	also know that the <computeroutput>LOADV</computeroutput> will
				2530	take around 20 x86 insns to do. So it seems plausible that doing
				2531	a prefetch of <computeroutput>ta</computeroutput> just before the
				2532	<computeroutput>LOADV</computeroutput> might just avoid a miss at
				2533	the <computeroutput>LOAD</computeroutput> point, and that might
				2534	be a significant performance win.</para>
				2535
				2536	<para>Prefetch insns are notoriously tempermental, more often
				2537	than not making things worse rather than better, so this would
				2538	require considerable fiddling around. It's complicated because
				2539	Intels and AMDs have different prefetch insns with different
				2540	semantics, so that too needs to be taken into account. As a
				2541	general rule, even placing the prefetches before the
				2542	<computeroutput>LOADV</computeroutput> insn is too near the
				2543	<computeroutput>LOAD</computeroutput>; the ideal distance is
				2544	apparently circa 200 CPU cycles. So it might be worth having
				2545	another analysis/transformation pass which pushes prefetches as
				2546	far back as possible, hopefully immediately after the effective
				2547	address becomes available.</para>
				2548
				2549	<para>Doing too many prefetches is also bad because they soak up
				2550	bus bandwidth / cpu resources, so some cleverness in deciding
				2551	which loads to prefetch and which to not might be helpful. One
				2552	can imagine not prefetching client-stack-relative
				2553	(<computeroutput>%EBP</computeroutput> or
				2554	<computeroutput>%ESP</computeroutput>) accesses, since the stack
				2555	in general tends to show good locality anyway.</para>
				2556
				2557	<para>There's quite a lot of experimentation to do here, but I
				2558	think it might make an interesting week's work for
				2559	someone.</para>
				2560
				2561	<para>As of 15-ish March 2002, I've started to experiment with
				2562	this, using the AMD
				2563	<computeroutput>prefetch/prefetchw</computeroutput> insns.</para>
				2564
				2565	</sect2>
				2566
				2567
				2568	<sect2 id="mc-tech-docs.pranges" xreflabel="User-defined Permission Ranges">
				2569	<title>User-defined Permission Ranges</title>
				2570
				2571	<para>This is quite a large project -- perhaps a month's hacking
				2572	for a capable hacker to do a good job -- but it's potentially
				2573	very interesting. The outcome would be that Valgrind could
				2574	detect a whole class of bugs which it currently cannot.</para>
				2575
				2576	<para>The presentation falls into two pieces.</para>
				2577
				2578	<sect3 id="mc-tech-docs.psetting"
				2579	xreflabel="Part 1: User-defined Address-range Permission Setting">
				2580	<title>Part 1: User-defined Address-range Permission Setting</title>
				2581
				2582	<para>Valgrind intercepts the client's
				2583	<computeroutput>malloc</computeroutput>,
				2584	<computeroutput>free</computeroutput>, etc calls, watches system
				2585	calls, and watches the stack pointer move. This is currently the
				2586	only way it knows about which addresses are valid and which not.
				2587	Sometimes the client program knows extra information about its
				2588	memory areas. For example, the client could at some point know
				2589	that all elements of an array are out-of-date. We would like to
				2590	be able to convey to Valgrind this information that the array is
				2591	now addressable-but-uninitialised, so that Valgrind can then warn
				2592	if elements are used before they get new values.</para>
				2593
				2594	<para>What I would like are some macros like this:</para>
				2595	<programlisting><![CDATA[
				2596	VALGRIND_MAKE_NOACCESS(addr, len)
				2597	VALGRIND_MAKE_WRITABLE(addr, len)
				2598	VALGRIND_MAKE_READABLE(addr, len)]]></programlisting>
				2599
				2600	<para>and also, to check that memory is
				2601	addressible/initialised,</para>
				2602	<programlisting><![CDATA[
				2603	VALGRIND_CHECK_ADDRESSIBLE(addr, len)
				2604	VALGRIND_CHECK_INITIALISED(addr, len)]]></programlisting>
				2605
				2606	<para>I then include in my sources a header defining these
				2607	macros, rebuild my app, run under Valgrind, and get user-defined
				2608	checks.</para>
				2609
				2610	<para>Now here's a neat trick. It's a nuisance to have to
				2611	re-link the app with some new library which implements the above
				2612	macros. So the idea is to define the macros so that the
				2613	resulting executable is still completely stand-alone, and can be
				2614	run without Valgrind, in which case the macros do nothing, but
				2615	when run on Valgrind, the Right Thing happens. How to do this?
				2616	The idea is for these macros to turn into a piece of inline
				2617	assembly code, which (1) has no effect when run on the real CPU,
				2618	(2) is easily spotted by Valgrind's JITter, and (3) no sane
				2619	person would ever write, which is important for avoiding false
				2620	matches in (2). So here's a suggestion:</para>
				2621	<programlisting><![CDATA[
				2622	VALGRIND_MAKE_NOACCESS(addr, len)]]></programlisting>
				2623
				2624	<para>becomes (roughly speaking)</para>
				2625	<programlisting><![CDATA[
				2626	movl addr, %eax
				2627	movl len, %ebx
				2628	movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be
				2629	-- 2, etc
				2630	rorl $13, %ecx
				2631	rorl $19, %ecx
				2632	rorl $11, %eax
				2633	rorl $21, %eax]]></programlisting>
				2634
				2635	<para>The rotate sequences have no effect, and it's unlikely they
				2636	would appear for any other reason, but they define a unique
				2637	byte-sequence which the JITter can easily spot. Using the
				2638	operand constraints section at the end of a gcc inline-assembly
				2639	statement, we can tell gcc that the assembly fragment kills
				2640	<computeroutput>%eax</computeroutput>,
				2641	<computeroutput>%ebx</computeroutput>,
				2642	<computeroutput>%ecx</computeroutput> and the condition codes, so
				2643	this fragment is made harmless when not running on Valgrind, runs
				2644	quickly when not on Valgrind, and does not require any other
				2645	library support.</para>
				2646
				2647
				2648	</sect3>
				2649
				2650
				2651	<sect3 id="mc-tech-docs.prange-detect"
				2652	xreflabel="Part 2: Using it to detect Interference between Stack
				2653	Variables">
				2654	<title>Part 2: Using it to detect Interference between Stack
				2655	Variables</title>
				2656
				2657	<para>Currently Valgrind cannot detect errors of the following
				2658	form:</para>
				2659	<programlisting><![CDATA[
				2660	void fooble ( void )
				2661	{
				2662	int a[10];
				2663	int b[10];
				2664	a[10] = 99;
				2665	}]]></programlisting>
				2666
				2667	<para>Now imagine rewriting this as</para>
				2668	<programlisting><![CDATA[
				2669	void fooble ( void )
				2670	{
				2671	int spacer0;
				2672	int a[10];
				2673	int spacer1;
				2674	int b[10];
				2675	int spacer2;
				2676	VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
				2677	VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
				2678	VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
				2679	a[10] = 99;
				2680	}]]></programlisting>
				2681
				2682	<para>Now the invalid write is certain to hit
				2683	<computeroutput>spacer0</computeroutput> or
				2684	<computeroutput>spacer1</computeroutput>, so Valgrind will spot
				2685	the error.</para>
				2686
				2687	<para>There are two complications.</para>
				2688
				2689	<orderedlist>
				2690
				2691	<listitem>
				2692	<para>The first is that we don't want to annotate sources by
				2693	hand, so the Right Thing to do is to write a C/C++ parser,
				2694	annotator, prettyprinter which does this automatically, and
				2695	run it on post-CPP'd C/C++ source. See
				2696	http://www.cacheprof.org for an example of a system which
				2697	transparently inserts another phase into the gcc/g++
				2698	compilation route. The parser/prettyprinter is probably not
				2699	as hard as it sounds; I would write it in Haskell, a powerful
				2700	functional language well suited to doing symbolic
				2701	computation, with which I am intimately familar. There is
				2702	already a C parser written in Haskell by someone in the
				2703	Haskell community, and that would probably be a good starting
				2704	point.</para>
				2705	</listitem>
				2706
				2707
				2708	<listitem>
				2709	<para>The second complication is how to get rid of these
				2710	<computeroutput>NOACCESS</computeroutput> records inside
				2711	Valgrind when the instrumented function exits; after all,
				2712	these refer to stack addresses and will make no sense
				2713	whatever when some other function happens to re-use the same
				2714	stack address range, probably shortly afterwards. I think I
				2715	would be inclined to define a special stack-specific
				2716	macro:</para>
				2717	<programlisting><![CDATA[
				2718	VALGRIND_MAKE_NOACCESS_STACK(addr, len)]]></programlisting>
				2719	<para>which causes Valgrind to record the client's
				2720	<computeroutput>%ESP</computeroutput> at the time it is
				2721	executed. Valgrind will then watch for changes in
				2722	<computeroutput>%ESP</computeroutput> and discard such
				2723	records as soon as the protected area is uncovered by an
				2724	increase in <computeroutput>%ESP</computeroutput>. I
				2725	hesitate with this scheme only because it is potentially
				2726	expensive, if there are hundreds of such records, and
				2727	considering that changes in
				2728	<computeroutput>%ESP</computeroutput> already require
				2729	expensive messing with stack access permissions.</para>
				2730	</listitem>
				2731	</orderedlist>
				2732
				2733	<para>This is probably easier and more robust than for the
				2734	instrumenter program to try and spot all exit points for the
				2735	procedure and place suitable deallocation annotations there.
				2736	Plus C++ procedures can bomb out at any point if they get an
				2737	exception, so spotting return points at the source level just
				2738	won't work at all.</para>
				2739
				2740	<para>Although some work, it's all eminently doable, and it would
				2741	make Valgrind into an even-more-useful tool.</para>
				2742
				2743	</sect3>
				2744
				2745	</sect2>
				2746
				2747	</sect1>
				2748	</chapter>