Blame - memcheck/docs/mc_techdocs.html - fp2-dev/platform/external/valgrind

blob: f734fc41ce9000d41858265d8bf53ff6d8aec70b [file] [log] [blame]

sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	1	<html>
				2	<head>
				3	<style type="text/css">
				4	body { background-color: #ffffff;
				5	color: #000000;
				6	font-family: Times, Helvetica, Arial;
				7	font-size: 14pt}
				8	h4 { margin-bottom: 0.3em}
				9	code { color: #000000;
				10	font-family: Courier;
				11	font-size: 13pt }
				12	pre { color: #000000;
				13	font-family: Courier;
				14	font-size: 13pt }
				15	a:link { color: #0000C0;
				16	text-decoration: none; }
				17	a:visited { color: #0000C0;
				18	text-decoration: none; }
				19	a:active { color: #0000C0;
				20	text-decoration: none; }
				21	</style>
				22	<title>The design and implementation of Valgrind</title>
				23	</head>
				24
				25	<body bgcolor="#ffffff">
				26
sewardj	f555ac7	2002-11-18 00:07:28 +0000	[diff] [blame]	27	<a name="mc-techdocs"> </a>
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	28	<h1 align=center>The design and implementation of Valgrind</h1>
				29
				30	<center>
				31	Detailed technical notes for hackers, maintainers and the
				32	overly-curious<br>
				33	These notes pertain to snapshot 20020306<br>
				34	<p>
njn	3e87f7e	2003-04-08 11:08:45 +0000	[diff] [blame]	35	<a href="mailto:jseward@acm.org">jseward@acm.org</a><br>
nethercote	421281e	2003-11-20 16:20:55 +0000	[diff] [blame^]	36	<a href="http://valgrind.kde.org">http://valgrind.kde.org</a><br>
njn	0e1b514	2003-04-15 14:58:06 +0000	[diff] [blame]	37	Copyright © 2000-2003 Julian Seward
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	38	<p>
				39	Valgrind is licensed under the GNU General Public License,
				40	version 2<br>
				41	An open-source tool for finding memory-management problems in
				42	x86 GNU/Linux executables.
				43	</center>
				44
				45	<p>
				46
				47
				48
				49
				50	<hr width="100%">
				51
				52	<h2>Introduction</h2>
				53
				54	This document contains a detailed, highly-technical description of the
				55	internals of Valgrind. This is not the user manual; if you are an
				56	end-user of Valgrind, you do not want to read this. Conversely, if
				57	you really are a hacker-type and want to know how it works, I assume
				58	that you have read the user manual thoroughly.
				59	<p>
				60	You may need to read this document several times, and carefully. Some
				61	important things, I only say once.
				62
				63
				64	<h3>History</h3>
				65
				66	Valgrind came into public view in late Feb 2002. However, it has been
				67	under contemplation for a very long time, perhaps seriously for about
				68	five years. Somewhat over two years ago, I started working on the x86
				69	code generator for the Glasgow Haskell Compiler
				70	(http://www.haskell.org/ghc), gaining familiarity with x86 internals
				71	on the way. I then did Cacheprof (http://www.cacheprof.org), gaining
				72	further x86 experience. Some time around Feb 2000 I started
				73	experimenting with a user-space x86 interpreter for x86-Linux. This
				74	worked, but it was clear that a JIT-based scheme would be necessary to
				75	give reasonable performance for Valgrind. Design work for the JITter
				76	started in earnest in Oct 2000, and by early 2001 I had an x86-to-x86
				77	dynamic translator which could run quite large programs. This
				78	translator was in a sense pointless, since it did not do any
				79	instrumentation or checking.
				80
				81	<p>
				82	Most of the rest of 2001 was taken up designing and implementing the
				83	instrumentation scheme. The main difficulty, which consumed a lot
				84	of effort, was to design a scheme which did not generate large numbers
				85	of false uninitialised-value warnings. By late 2001 a satisfactory
				86	scheme had been arrived at, and I started to test it on ever-larger
				87	programs, with an eventual eye to making it work well enough so that
				88	it was helpful to folks debugging the upcoming version 3 of KDE. I've
				89	used KDE since before version 1.0, and wanted to Valgrind to be an
				90	indirect contribution to the KDE 3 development effort. At the start of
				91	Feb 02 the kde-core-devel crew started using it, and gave a huge
				92	amount of helpful feedback and patches in the space of three weeks.
				93	Snapshot 20020306 is the result.
				94
				95	<p>
				96	In the best Unix tradition, or perhaps in the spirit of Fred Brooks'
				97	depressing-but-completely-accurate epitaph "build one to throw away;
				98	you will anyway", much of Valgrind is a second or third rendition of
				99	the initial idea. The instrumentation machinery
				100	(<code>vg_translate.c</code>, <code>vg_memory.c</code>) and core CPU
				101	simulation (<code>vg_to_ucode.c</code>, <code>vg_from_ucode.c</code>)
				102	have had three redesigns and rewrites; the register allocator,
				103	low-level memory manager (<code>vg_malloc2.c</code>) and symbol table
				104	reader (<code>vg_symtab2.c</code>) are on the second rewrite. In a
				105	sense, this document serves to record some of the knowledge gained as
				106	a result.
				107
				108
				109	<h3>Design overview</h3>
				110
				111	Valgrind is compiled into a Linux shared object,
				112	<code>valgrind.so</code>, and also a dummy one,
				113	<code>valgrinq.so</code>, of which more later. The
				114	<code>valgrind</code> shell script adds <code>valgrind.so</code> to
				115	the <code>LD_PRELOAD</code> list of extra libraries to be
				116	loaded with any dynamically linked library. This is a standard trick,
				117	one which I assume the <code>LD_PRELOAD</code> mechanism was developed
				118	to support.
				119
				120	<p>
				121	<code>valgrind.so</code>
				122	is linked with the <code>-z initfirst</code> flag, which requests that
				123	its initialisation code is run before that of any other object in the
				124	executable image. When this happens, valgrind gains control. The
				125	real CPU becomes "trapped" in <code>valgrind.so</code> and the
				126	translations it generates. The synthetic CPU provided by Valgrind
				127	does, however, return from this initialisation function. So the
				128	normal startup actions, orchestrated by the dynamic linker
				129	<code>ld.so</code>, continue as usual, except on the synthetic CPU,
				130	not the real one. Eventually <code>main</code> is run and returns,
				131	and then the finalisation code of the shared objects is run,
				132	presumably in inverse order to which they were initialised. Remember,
				133	this is still all happening on the simulated CPU. Eventually
				134	<code>valgrind.so</code>'s own finalisation code is called. It spots
				135	this event, shuts down the simulated CPU, prints any error summaries
				136	and/or does leak detection, and returns from the initialisation code
				137	on the real CPU. At this point, in effect the real and synthetic CPUs
				138	have merged back into one, Valgrind has lost control of the program,
				139	and the program finally <code>exit()s</code> back to the kernel in the
				140	usual way.
				141
				142	<p>
daywalker	667c98f	2003-09-23 19:07:16 +0000	[diff] [blame]	143	The normal course of activity, once Valgrind has started up, is as
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	144	follows. Valgrind never runs any part of your program (usually
				145	referred to as the "client"), not a single byte of it, directly.
				146	Instead it uses function <code>VG_(translate)</code> to translate
				147	basic blocks (BBs, straight-line sequences of code) into instrumented
				148	translations, and those are run instead. The translations are stored
				149	in the translation cache (TC), <code>vg_tc</code>, with the
				150	translation table (TT), <code>vg_tt</code> supplying the
				151	original-to-translation code address mapping. Auxiliary array
				152	<code>VG_(tt_fast)</code> is used as a direct-map cache for fast
				153	lookups in TT; it usually achieves a hit rate of around 98% and
				154	facilitates an orig-to-trans lookup in 4 x86 insns, which is not bad.
				155
				156	<p>
				157	Function <code>VG_(dispatch)</code> in <code>vg_dispatch.S</code> is
				158	the heart of the JIT dispatcher. Once a translated code address has
				159	been found, it is executed simply by an x86 <code>call</code>
				160	to the translation. At the end of the translation, the next
				161	original code addr is loaded into <code>%eax</code>, and the
				162	translation then does a <code>ret</code>, taking it back to the
				163	dispatch loop, with, interestingly, zero branch mispredictions.
				164	The address requested in <code>%eax</code> is looked up first in
				165	<code>VG_(tt_fast)</code>, and, if not found, by calling C helper
				166	<code>VG_(search_transtab)</code>. If there is still no translation
				167	available, <code>VG_(dispatch)</code> exits back to the top-level
				168	C dispatcher <code>VG_(toploop)</code>, which arranges for
				169	<code>VG_(translate)</code> to make a new translation. All fairly
				170	unsurprising, really. There are various complexities described below.
				171
				172	<p>
				173	The translator, orchestrated by <code>VG_(translate)</code>, is
				174	complicated but entirely self-contained. It is described in great
				175	detail in subsequent sections. Translations are stored in TC, with TT
				176	tracking administrative information. The translations are subject to
				177	an approximate LRU-based management scheme. With the current
				178	settings, the TC can hold at most about 15MB of translations, and LRU
				179	passes prune it to about 13.5MB. Given that the
				180	orig-to-translation expansion ratio is about 13:1 to 14:1, this means
				181	TC holds translations for more or less a megabyte of original code,
				182	which generally comes to about 70000 basic blocks for C++ compiled
				183	with optimisation on. Generating new translations is expensive, so it
				184	is worth having a large TC to minimise the (capacity) miss rate.
				185
				186	<p>
				187	The dispatcher, <code>VG_(dispatch)</code>, receives hints from
				188	the translations which allow it to cheaply spot all control
				189	transfers corresponding to x86 <code>call</code> and <code>ret</code>
				190	instructions. It has to do this in order to spot some special events:
				191	<ul>
				192	<li>Calls to <code>VG_(shutdown)</code>. This is Valgrind's cue to
				193	exit. NOTE: actually this is done a different way; it should be
				194	cleaned up.
				195	<p>
				196	<li>Returns of system call handlers, to the return address
				197	<code>VG_(signalreturn_bogusRA)</code>. The signal simulator
				198	needs to know when a signal handler is returning, so we spot
				199	jumps (returns) to this address.
				200	<p>
				201	<li>Calls to <code>vg_trap_here</code>. All <code>malloc</code>,
				202	<code>free</code>, etc calls that the client program makes are
				203	eventually routed to a call to <code>vg_trap_here</code>,
				204	and Valgrind does its own special thing with these calls.
				205	In effect this provides a trapdoor, by which Valgrind can
				206	intercept certain calls on the simulated CPU, run the call as it
				207	sees fit itself (on the real CPU), and return the result to
				208	the simulated CPU, quite transparently to the client program.
				209	</ul>
				210	Valgrind intercepts the client's <code>malloc</code>,
				211	<code>free</code>, etc,
				212	calls, so that it can store additional information. Each block
				213	<code>malloc</code>'d by the client gives rise to a shadow block
				214	in which Valgrind stores the call stack at the time of the
				215	<code>malloc</code>
				216	call. When the client calls <code>free</code>, Valgrind tries to
				217	find the shadow block corresponding to the address passed to
				218	<code>free</code>, and emits an error message if none can be found.
				219	If it is found, the block is placed on the freed blocks queue
				220	<code>vg_freed_list</code>, it is marked as inaccessible, and
				221	its shadow block now records the call stack at the time of the
				222	<code>free</code> call. Keeping <code>free</code>'d blocks in
				223	this queue allows Valgrind to spot all (presumably invalid) accesses
				224	to them. However, once the volume of blocks in the free queue
				225	exceeds <code>VG_(clo_freelist_vol)</code>, blocks are finally
				226	removed from the queue.
				227
				228	<p>
				229	Keeping track of A and V bits (note: if you don't know what these are,
				230	you haven't read the user guide carefully enough) for memory is done
				231	in <code>vg_memory.c</code>. This implements a sparse array structure
				232	which covers the entire 4G address space in a way which is reasonably
				233	fast and reasonably space efficient. The 4G address space is divided
				234	up into 64K sections, each covering 64Kb of address space. Given a
				235	32-bit address, the top 16 bits are used to select one of the 65536
				236	entries in <code>VG_(primary_map)</code>. The resulting "secondary"
				237	(<code>SecMap</code>) holds A and V bits for the 64k of address space
				238	chunk corresponding to the lower 16 bits of the address.
				239
				240
				241	<h3>Design decisions</h3>
				242
				243	Some design decisions were motivated by the need to make Valgrind
				244	debuggable. Imagine you are writing a CPU simulator. It works fairly
				245	well. However, you run some large program, like Netscape, and after
				246	tens of millions of instructions, it crashes. How can you figure out
				247	where in your simulator the bug is?
				248
				249	<p>
				250	Valgrind's answer is: cheat. Valgrind is designed so that it is
				251	possible to switch back to running the client program on the real
				252	CPU at any point. Using the <code>--stop-after= </code> flag, you can
				253	ask Valgrind to run just some number of basic blocks, and then
				254	run the rest of the way on the real CPU. If you are searching for
				255	a bug in the simulated CPU, you can use this to do a binary search,
				256	which quickly leads you to the specific basic block which is
				257	causing the problem.
				258
				259	<p>
				260	This is all very handy. It does constrain the design in certain
				261	unimportant ways. Firstly, the layout of memory, when viewed from the
				262	client's point of view, must be identical regardless of whether it is
				263	running on the real or simulated CPU. This means that Valgrind can't
				264	do pointer swizzling -- well, no great loss -- and it can't run on
				265	the same stack as the client -- again, no great loss.
				266	Valgrind operates on its own stack, <code>VG_(stack)</code>, which
				267	it switches to at startup, temporarily switching back to the client's
				268	stack when doing system calls for the client.
				269
				270	<p>
				271	Valgrind also receives signals on its own stack,
				272	<code>VG_(sigstack)</code>, but for different gruesome reasons
				273	discussed below.
				274
				275	<p>
				276	This nice clean switch-back-to-the-real-CPU-whenever-you-like story
				277	is muddied by signals. Problem is that signals arrive at arbitrary
				278	times and tend to slightly perturb the basic block count, with the
				279	result that you can get close to the basic block causing a problem but
				280	can't home in on it exactly. My kludgey hack is to define
				281	<code>SIGNAL_SIMULATION</code> to 1 towards the bottom of
				282	<code>vg_syscall_mem.c</code>, so that signal handlers are run on the
				283	real CPU and don't change the BB counts.
				284
				285	<p>
				286	A second hole in the switch-back-to-real-CPU story is that Valgrind's
				287	way of delivering signals to the client is different from that of the
				288	kernel. Specifically, the layout of the signal delivery frame, and
				289	the mechanism used to detect a sighandler returning, are different.
				290	So you can't expect to make the transition inside a sighandler and
				291	still have things working, but in practice that's not much of a
				292	restriction.
				293
				294	<p>
				295	Valgrind's implementation of <code>malloc</code>, <code>free</code>,
				296	etc, (in <code>vg_clientmalloc.c</code>, not the low-level stuff in
				297	<code>vg_malloc2.c</code>) is somewhat complicated by the need to
				298	handle switching back at arbitrary points. It does work tho.
				299
				300
				301
				302	<h3>Correctness</h3>
				303
				304	There's only one of me, and I have a Real Life (tm) as well as hacking
				305	Valgrind [allegedly :-]. That means I don't have time to waste
				306	chasing endless bugs in Valgrind. My emphasis is therefore on doing
				307	everything as simply as possible, with correctness, stability and
				308	robustness being the number one priority, more important than
				309	performance or functionality. As a result:
				310	<ul>
				311	<li>The code is absolutely loaded with assertions, and these are
				312	<b>permanently enabled.</b> I have no plan to remove or disable
				313	them later. Over the past couple of months, as valgrind has
				314	become more widely used, they have shown their worth, pulling
				315	up various bugs which would otherwise have appeared as
				316	hard-to-find segmentation faults.
				317	<p>
				318	I am of the view that it's acceptable to spend 5% of the total
				319	running time of your valgrindified program doing assertion checks
				320	and other internal sanity checks.
				321	<p>
				322	<li>Aside from the assertions, valgrind contains various sets of
				323	internal sanity checks, which get run at varying frequencies
				324	during normal operation. <code>VG_(do_sanity_checks)</code>
				325	runs every 1000 basic blocks, which means 500 to 2000 times/second
				326	for typical machines at present. It checks that Valgrind hasn't
				327	overrun its private stack, and does some simple checks on the
				328	memory permissions maps. Once every 25 calls it does some more
				329	extensive checks on those maps. Etc, etc.
				330	<p>
				331	The following components also have sanity check code, which can
				332	be enabled to aid debugging:
				333	<ul>
				334	<li>The low-level memory-manager
				335	(<code>VG_(mallocSanityCheckArena)</code>). This does a
				336	complete check of all blocks and chains in an arena, which
				337	is very slow. Is not engaged by default.
				338	<p>
				339	<li>The symbol table reader(s): various checks to ensure
				340	uniqueness of mappings; see <code>VG_(read_symbols)</code>
				341	for a start. Is permanently engaged.
				342	<p>
				343	<li>The A and V bit tracking stuff in <code>vg_memory.c</code>.
				344	This can be compiled with cpp symbol
				345	<code>VG_DEBUG_MEMORY</code> defined, which removes all the
				346	fast, optimised cases, and uses simple-but-slow fallbacks
				347	instead. Not engaged by default.
				348	<p>
				349	<li>Ditto <code>VG_DEBUG_LEAKCHECK</code>.
				350	<p>
				351	<li>The JITter parses x86 basic blocks into sequences of
				352	UCode instructions. It then sanity checks each one with
				353	<code>VG_(saneUInstr)</code> and sanity checks the sequence
				354	as a whole with <code>VG_(saneUCodeBlock)</code>. This stuff
				355	is engaged by default, and has caught some way-obscure bugs
				356	in the simulated CPU machinery in its time.
				357	<p>
				358	<li>The system call wrapper does
				359	<code>VG_(first_and_last_secondaries_look_plausible)</code> after
				360	every syscall; this is known to pick up bugs in the syscall
				361	wrappers. Engaged by default.
				362	<p>
				363	<li>The main dispatch loop, in <code>VG_(dispatch)</code>, checks
				364	that translations do not set <code>%ebp</code> to any value
				365	different from <code>VG_EBP_DISPATCH_CHECKED</code> or
njn	3e87f7e	2003-04-08 11:08:45 +0000	[diff] [blame]	366	<code>& VG_(baseBlock)</code>. In effect this test is free,
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	367	and is permanently engaged.
				368	<p>
				369	<li>There are a couple of ifdefed-out consistency checks I
				370	inserted whilst debugging the new register allocater,
				371	<code>vg_do_register_allocation</code>.
				372	</ul>
				373	<p>
				374	<li>I try to avoid techniques, algorithms, mechanisms, etc, for which
				375	I can supply neither a convincing argument that they are correct,
				376	nor sanity-check code which might pick up bugs in my
				377	implementation. I don't always succeed in this, but I try.
				378	Basically the idea is: avoid techniques which are, in practice,
				379	unverifiable, in some sense. When doing anything, always have in
				380	mind: "how can I verify that this is correct?"
				381	</ul>
				382
				383	<p>
				384	Some more specific things are:
				385
				386	<ul>
				387	<li>Valgrind runs in the same namespace as the client, at least from
				388	<code>ld.so</code>'s point of view, and it therefore absolutely
				389	had better not export any symbol with a name which could clash
				390	with that of the client or any of its libraries. Therefore, all
				391	globally visible symbols exported from <code>valgrind.so</code>
				392	are defined using the <code>VG_</code> CPP macro. As you'll see
				393	from <code>vg_constants.h</code>, this appends some arbitrary
				394	prefix to the symbol, in order that it be, we hope, globally
				395	unique. Currently the prefix is <code>vgPlain_</code>. For
				396	convenience there are also <code>VGM_</code>, <code>VGP_</code>
				397	and <code>VGOFF_</code>. All locally defined symbols are declared
				398	<code>static</code> and do not appear in the final shared object.
				399	<p>
				400	To check this, I periodically do
				401	<code>nm valgrind.so \| grep " T "</code>,
				402	which shows you all the globally exported text symbols.
				403	They should all have an approved prefix, except for those like
				404	<code>malloc</code>, <code>free</code>, etc, which we deliberately
				405	want to shadow and take precedence over the same names exported
				406	from <code>glibc.so</code>, so that valgrind can intercept those
				407	calls easily. Similarly, <code>nm valgrind.so \| grep " D "</code>
				408	allows you to find any rogue data-segment symbol names.
				409	<p>
				410	<li>Valgrind tries, and almost succeeds, in being completely
				411	independent of all other shared objects, in particular of
				412	<code>glibc.so</code>. For example, we have our own low-level
				413	memory manager in <code>vg_malloc2.c</code>, which is a fairly
				414	standard malloc/free scheme augmented with arenas, and
				415	<code>vg_mylibc.c</code> exports reimplementations of various bits
				416	and pieces you'd normally get from the C library.
				417	<p>
				418	Why all the hassle? Because imagine the potential chaos of both
				419	the simulated and real CPUs executing in <code>glibc.so</code>.
				420	It just seems simpler and cleaner to be completely self-contained,
				421	so that only the simulated CPU visits <code>glibc.so</code>. In
				422	practice it's not much hassle anyway. Also, valgrind starts up
				423	before glibc has a chance to initialise itself, and who knows what
				424	difficulties that could lead to. Finally, glibc has definitions
				425	for some types, specifically <code>sigset_t</code>, which conflict
				426	(are different from) the Linux kernel's idea of same. When
				427	Valgrind wants to fiddle around with signal stuff, it wants to
				428	use the kernel's definitions, not glibc's definitions. So it's
				429	simplest just to keep glibc out of the picture entirely.
				430	<p>
				431	To find out which glibc symbols are used by Valgrind, reinstate
				432	the link flags <code>-nostdlib -Wl,-no-undefined</code>. This
				433	causes linking to fail, but will tell you what you depend on.
				434	I have mostly, but not entirely, got rid of the glibc
				435	dependencies; what remains is, IMO, fairly harmless. AFAIK the
				436	current dependencies are: <code>memset</code>,
				437	<code>memcmp</code>, <code>stat</code>, <code>system</code>,
				438	<code>sbrk</code>, <code>setjmp</code> and <code>longjmp</code>.
				439
				440	<p>
				441	<li>Similarly, valgrind should not really import any headers other
				442	than the Linux kernel headers, since it knows of no API other than
				443	the kernel interface to talk to. At the moment this is really not
				444	in a good state, and <code>vg_syscall_mem</code> imports, via
				445	<code>vg_unsafe.h</code>, a significant number of C-library
				446	headers so as to know the sizes of various structs passed across
				447	the kernel boundary. This is of course completely bogus, since
				448	there is no guarantee that the C library's definitions of these
				449	structs matches those of the kernel. I have started to sort this
				450	out using <code>vg_kerneliface.h</code>, into which I had intended
				451	to copy all kernel definitions which valgrind could need, but this
				452	has not gotten very far. At the moment it mostly contains
				453	definitions for <code>sigset_t</code> and <code>struct
				454	sigaction</code>, since the kernel's definition for these really
				455	does clash with glibc's. I plan to use a <code>vki_</code> prefix
				456	on all these types and constants, to denote the fact that they
				457	pertain to <b>V</b>algrind's <b>K</b>ernel <b>I</b>nterface.
				458	<p>
				459	Another advantage of having a <code>vg_kerneliface.h</code> file
				460	is that it makes it simpler to interface to a different kernel.
				461	Once can, for example, easily imagine writing a new
				462	<code>vg_kerneliface.h</code> for FreeBSD, or x86 NetBSD.
				463
				464	</ul>
				465
				466	<h3>Current limitations</h3>
				467
				468	No threads. I think fixing this is close to a research-grade problem.
				469	<p>
				470	No MMX. Fixing this should be relatively easy, using the same giant
				471	trick used for x86 FPU instructions. See below.
				472	<p>
				473	Support for weird (non-POSIX) signal stuff is patchy. Does anybody
				474	care?
				475	<p>
				476
				477
				478
				479
				480	<hr width="100%">
				481
				482	<h2>The instrumenting JITter</h2>
				483
				484	This really is the heart of the matter. We begin with various side
				485	issues.
				486
				487	<h3>Run-time storage, and the use of host registers</h3>
				488
				489	Valgrind translates client (original) basic blocks into instrumented
				490	basic blocks, which live in the translation cache TC, until either the
				491	client finishes or the translations are ejected from TC to make room
				492	for newer ones.
				493	<p>
				494	Since it generates x86 code in memory, Valgrind has complete control
				495	of the use of registers in the translations. Now pay attention. I
				496	shall say this only once, and it is important you understand this. In
				497	what follows I will refer to registers in the host (real) cpu using
				498	their standard names, <code>%eax</code>, <code>%edi</code>, etc. I
				499	refer to registers in the simulated CPU by capitalising them:
				500	<code>%EAX</code>, <code>%EDI</code>, etc. These two sets of
				501	registers usually bear no direct relationship to each other; there is
				502	no fixed mapping between them. This naming scheme is used fairly
				503	consistently in the comments in the sources.
				504	<p>
				505	Host registers, once things are up and running, are used as follows:
				506	<ul>
				507	<li><code>%esp</code>, the real stack pointer, points
				508	somewhere in Valgrind's private stack area,
				509	<code>VG_(stack)</code> or, transiently, into its signal delivery
				510	stack, <code>VG_(sigstack)</code>.
				511	<p>
				512	<li><code>%edi</code> is used as a temporary in code generation; it
				513	is almost always dead, except when used for the <code>Left</code>
				514	value-tag operations.
				515	<p>
				516	<li><code>%eax</code>, <code>%ebx</code>, <code>%ecx</code>,
				517	<code>%edx</code> and <code>%esi</code> are available to
				518	Valgrind's register allocator. They are dead (carry unimportant
				519	values) in between translations, and are live only in
				520	translations. The one exception to this is <code>%eax</code>,
				521	which, as mentioned far above, has a special significance to the
				522	dispatch loop <code>VG_(dispatch)</code>: when a translation
				523	returns to the dispatch loop, <code>%eax</code> is expected to
				524	contain the original-code-address of the next translation to run.
				525	The register allocator is so good at minimising spill code that
				526	using five regs and not having to save/restore <code>%edi</code>
				527	actually gives better code than allocating to <code>%edi</code>
				528	as well, but then having to push/pop it around special uses.
				529	<p>
				530	<li><code>%ebp</code> points permanently at
				531	<code>VG_(baseBlock)</code>. Valgrind's translations are
				532	position-independent, partly because this is convenient, but also
				533	because translations get moved around in TC as part of the LRUing
				534	activity. <b>All</b> static entities which need to be referred to
				535	from generated code, whether data or helper functions, are stored
				536	starting at <code>VG_(baseBlock)</code> and are therefore reached
				537	by indexing from <code>%ebp</code>. There is but one exception,
				538	which is that by placing the value
				539	<code>VG_EBP_DISPATCH_CHECKED</code>
				540	in <code>%ebp</code> just before a return to the dispatcher,
				541	the dispatcher is informed that the next address to run,
				542	in <code>%eax</code>, requires special treatment.
				543	<p>
				544	<li>The real machine's FPU state is pretty much unimportant, for
				545	reasons which will become obvious. Ditto its <code>%eflags</code>
				546	register.
				547	</ul>
				548
				549	<p>
				550	The state of the simulated CPU is stored in memory, in
				551	<code>VG_(baseBlock)</code>, which is a block of 200 words IIRC.
				552	Recall that <code>%ebp</code> points permanently at the start of this
				553	block. Function <code>vg_init_baseBlock</code> decides what the
				554	offsets of various entities in <code>VG_(baseBlock)</code> are to be,
				555	and allocates word offsets for them. The code generator then emits
				556	<code>%ebp</code> relative addresses to get at those things. The
				557	sequence in which entities are allocated has been carefully chosen so
				558	that the 32 most popular entities come first, because this means 8-bit
				559	offsets can be used in the generated code.
				560
				561	<p>
				562	If I was clever, I could make <code>%ebp</code> point 32 words along
				563	<code>VG_(baseBlock)</code>, so that I'd have another 32 words of
				564	short-form offsets available, but that's just complicated, and it's
				565	not important -- the first 32 words take 99% (or whatever) of the
				566	traffic.
				567
				568	<p>
				569	Currently, the sequence of stuff in <code>VG_(baseBlock)</code> is as
				570	follows:
				571	<ul>
				572	<li>9 words, holding the simulated integer registers,
				573	<code>%EAX</code> .. <code>%EDI</code>, and the simulated flags,
				574	<code>%EFLAGS</code>.
				575	<p>
				576	<li>Another 9 words, holding the V bit "shadows" for the above 9 regs.
				577	<p>
				578	<li>The <b>addresses</b> of various helper routines called from
				579	generated code:
				580	<code>VG_(helper_value_check4_fail)</code>,
				581	<code>VG_(helper_value_check0_fail)</code>,
				582	which register V-check failures,
				583	<code>VG_(helperc_STOREV4)</code>,
				584	<code>VG_(helperc_STOREV1)</code>,
				585	<code>VG_(helperc_LOADV4)</code>,
				586	<code>VG_(helperc_LOADV1)</code>,
				587	which do stores and loads of V bits to/from the
				588	sparse array which keeps track of V bits in memory,
				589	and
				590	<code>VGM_(handle_esp_assignment)</code>, which messes with
				591	memory addressibility resulting from changes in <code>%ESP</code>.
				592	<p>
				593	<li>The simulated <code>%EIP</code>.
				594	<p>
				595	<li>24 spill words, for when the register allocator can't make it work
				596	with 5 measly registers.
				597	<p>
				598	<li>Addresses of helpers <code>VG_(helperc_STOREV2)</code>,
				599	<code>VG_(helperc_LOADV2)</code>. These are here because 2-byte
				600	loads and stores are relatively rare, so are placed above the
				601	magic 32-word offset boundary.
				602	<p>
				603	<li>For similar reasons, addresses of helper functions
				604	<code>VGM_(fpu_write_check)</code> and
				605	<code>VGM_(fpu_read_check)</code>, which handle the A/V maps
				606	testing and changes required by FPU writes/reads.
				607	<p>
				608	<li>Some other boring helper addresses:
				609	<code>VG_(helper_value_check2_fail)</code> and
				610	<code>VG_(helper_value_check1_fail)</code>. These are probably
				611	never emitted now, and should be removed.
				612	<p>
				613	<li>The entire state of the simulated FPU, which I believe to be
				614	108 bytes long.
				615	<p>
				616	<li>Finally, the addresses of various other helper functions in
				617	<code>vg_helpers.S</code>, which deal with rare situations which
				618	are tedious or difficult to generate code in-line for.
				619	</ul>
				620
				621	<p>
				622	As a general rule, the simulated machine's state lives permanently in
				623	memory at <code>VG_(baseBlock)</code>. However, the JITter does some
				624	optimisations which allow the simulated integer registers to be
				625	cached in real registers over multiple simulated instructions within
				626	the same basic block. These are always flushed back into memory at
				627	the end of every basic block, so that the in-memory state is
				628	up-to-date between basic blocks. (This flushing is implied by the
				629	statement above that the real machine's allocatable registers are
				630	dead in between simulated blocks).
				631
				632
				633	<h3>Startup, shutdown, and system calls</h3>
				634
				635	Getting into of Valgrind (<code>VG_(startup)</code>, called from
				636	<code>valgrind.so</code>'s initialisation section), really means
				637	copying the real CPU's state into <code>VG_(baseBlock)</code>, and
				638	then installing our own stack pointer, etc, into the real CPU, and
				639	then starting up the JITter. Exiting valgrind involves copying the
				640	simulated state back to the real state.
				641
				642	<p>
				643	Unfortunately, there's a complication at startup time. Problem is
				644	that at the point where we need to take a snapshot of the real CPU's
				645	state, the offsets in <code>VG_(baseBlock)</code> are not set up yet,
				646	because to do so would involve disrupting the real machine's state
				647	significantly. The way round this is to dump the real machine's state
				648	into a temporary, static block of memory,
				649	<code>VG_(m_state_static)</code>. We can then set up the
				650	<code>VG_(baseBlock)</code> offsets at our leisure, and copy into it
				651	from <code>VG_(m_state_static)</code> at some convenient later time.
				652	This copying is done by
				653	<code>VG_(copy_m_state_static_to_baseBlock)</code>.
				654
				655	<p>
				656	On exit, the inverse transformation is (rather unnecessarily) used:
				657	stuff in <code>VG_(baseBlock)</code> is copied to
				658	<code>VG_(m_state_static)</code>, and the assembly stub then copies
				659	from <code>VG_(m_state_static)</code> into the real machine registers.
				660
				661	<p>
				662	Doing system calls on behalf of the client (<code>vg_syscall.S</code>)
				663	is something of a half-way house. We have to make the world look
				664	sufficiently like that which the client would normally have to make
				665	the syscall actually work properly, but we can't afford to lose
				666	control. So the trick is to copy all of the client's state, <b>except
				667	its program counter</b>, into the real CPU, do the system call, and
				668	copy the state back out. Note that the client's state includes its
				669	stack pointer register, so one effect of this partial restoration is
				670	to cause the system call to be run on the client's stack, as it should
				671	be.
				672
				673	<p>
				674	As ever there are complications. We have to save some of our own state
				675	somewhere when restoring the client's state into the CPU, so that we
				676	can keep going sensibly afterwards. In fact the only thing which is
				677	important is our own stack pointer, but for paranoia reasons I save
				678	and restore our own FPU state as well, even though that's probably
				679	pointless.
				680
				681	<p>
				682	The complication on the above complication is, that for horrible
				683	reasons to do with signals, we may have to handle a second client
				684	system call whilst the client is blocked inside some other system
				685	call (unbelievable!). That means there's two sets of places to
				686	dump Valgrind's stack pointer and FPU state across the syscall,
				687	and we decide which to use by consulting
				688	<code>VG_(syscall_depth)</code>, which is in turn maintained by
				689	<code>VG_(wrap_syscall)</code>.
				690
				691
				692
				693	<h3>Introduction to UCode</h3>
				694
				695	UCode lies at the heart of the x86-to-x86 JITter. The basic premise
				696	is that dealing the the x86 instruction set head-on is just too darn
				697	complicated, so we do the traditional compiler-writer's trick and
				698	translate it into a simpler, easier-to-deal-with form.
				699
				700	<p>
				701	In normal operation, translation proceeds through six stages,
				702	coordinated by <code>VG_(translate)</code>:
				703	<ol>
				704	<li>Parsing of an x86 basic block into a sequence of UCode
				705	instructions (<code>VG_(disBB)</code>).
				706	<p>
				707	<li>UCode optimisation (<code>vg_improve</code>), with the aim of
				708	caching simulated registers in real registers over multiple
				709	simulated instructions, and removing redundant simulated
				710	<code>%EFLAGS</code> saving/restoring.
				711	<p>
				712	<li>UCode instrumentation (<code>vg_instrument</code>), which adds
				713	value and address checking code.
				714	<p>
				715	<li>Post-instrumentation cleanup (<code>vg_cleanup</code>), removing
				716	redundant value-check computations.
				717	<p>
				718	<li>Register allocation (<code>vg_do_register_allocation</code>),
				719	which, note, is done on UCode.
				720	<p>
				721	<li>Emission of final instrumented x86 code
				722	(<code>VG_(emit_code)</code>).
				723	</ol>
				724
				725	<p>
				726	Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
				727	transformation passes, all on straight-line blocks of UCode (type
				728	<code>UCodeBlock</code>). Steps 2 and 4 are optimisation passes and
				729	can be disabled for debugging purposes, with
				730	<code>--optimise=no</code> and <code>--cleanup=no</code> respectively.
				731
				732	<p>
				733	Valgrind can also run in a no-instrumentation mode, given
				734	<code>--instrument=no</code>. This is useful for debugging the JITter
				735	quickly without having to deal with the complexity of the
				736	instrumentation mechanism too. In this mode, steps 3 and 4 are
				737	omitted.
				738
				739	<p>
				740	These flags combine, so that <code>--instrument=no</code> together with
				741	<code>--optimise=no</code> means only steps 1, 5 and 6 are used.
				742	<code>--single-step=yes</code> causes each x86 instruction to be
				743	treated as a single basic block. The translations are terrible but
				744	this is sometimes instructive.
				745
				746	<p>
				747	The <code>--stop-after=N</code> flag switches back to the real CPU
				748	after <code>N</code> basic blocks. It also re-JITs the final basic
				749	block executed and prints the debugging info resulting, so this
				750	gives you a way to get a quick snapshot of how a basic block looks as
				751	it passes through the six stages mentioned above. If you want to
				752	see full information for every block translated (probably not, but
				753	still ...) find, in <code>VG_(translate)</code>, the lines
				754	<br><code> dis = True;</code>
				755	<br><code> dis = debugging_translation;</code>
				756	<br>
				757	and comment out the second line. This will spew out debugging
				758	junk faster than you can possibly imagine.
				759
				760
				761
				762	<h3>UCode operand tags: type <code>Tag</code></h3>
				763
				764	UCode is, more or less, a simple two-address RISC-like code. In
njn	3e87f7e	2003-04-08 11:08:45 +0000	[diff] [blame]	765	keeping with the x86 AT&T assembly syntax, generally speaking the
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	766	first operand is the source operand, and the second is the destination
				767	operand, which is modified when the uinstr is notionally executed.
				768
				769	<p>
				770	UCode instructions have up to three operand fields, each of which has
				771	a corresponding <code>Tag</code> describing it. Possible values for
				772	the tag are:
				773
				774	<ul>
				775	<li><code>NoValue</code>: indicates that the field is not in use.
				776	<p>
				777	<li><code>Lit16</code>: the field contains a 16-bit literal.
				778	<p>
				779	<li><code>Literal</code>: the field denotes a 32-bit literal, whose
				780	value is stored in the <code>lit32</code> field of the uinstr
				781	itself. Since there is only one <code>lit32</code> for the whole
				782	uinstr, only one operand field may contain this tag.
				783	<p>
				784	<li><code>SpillNo</code>: the field contains a spill slot number, in
				785	the range 0 to 23 inclusive, denoting one of the spill slots
				786	contained inside <code>VG_(baseBlock)</code>. Such tags only
				787	exist after register allocation.
				788	<p>
				789	<li><code>RealReg</code>: the field contains a number in the range 0
				790	to 7 denoting an integer x86 ("real") register on the host. The
				791	number is the Intel encoding for integer registers. Such tags
				792	only exist after register allocation.
				793	<p>
				794	<li><code>ArchReg</code>: the field contains a number in the range 0
				795	to 7 denoting an integer x86 register on the simulated CPU. In
				796	reality this means a reference to one of the first 8 words of
				797	<code>VG_(baseBlock)</code>. Such tags can exist at any point in
				798	the translation process.
				799	<p>
				800	<li>Last, but not least, <code>TempReg</code>. The field contains the
				801	number of one of an infinite set of virtual (integer)
				802	registers. <code>TempReg</code>s are used everywhere throughout
				803	the translation process; you can have as many as you want. The
				804	register allocator maps as many as it can into
				805	<code>RealReg</code>s and turns the rest into
				806	<code>SpillNo</code>s, so <code>TempReg</code>s should not exist
				807	after the register allocation phase.
				808	<p>
				809	<code>TempReg</code>s are always 32 bits long, even if the data
				810	they hold is logically shorter. In that case the upper unused
				811	bits are required, and, I think, generally assumed, to be zero.
				812	<code>TempReg</code>s holding V bits for quantities shorter than
				813	32 bits are expected to have ones in the unused places, since a
				814	one denotes "undefined".
				815	</ul>
				816
				817
				818	<h3>UCode instructions: type <code>UInstr</code></h3>
				819
				820	<p>
				821	UCode was carefully designed to make it possible to do register
				822	allocation on UCode and then translate the result into x86 code
				823	without needing any extra registers ... well, that was the original
				824	plan, anyway. Things have gotten a little more complicated since
				825	then. In what follows, UCode instructions are referred to as uinstrs,
				826	to distinguish them from x86 instructions. Uinstrs of course have
				827	uopcodes which are (naturally) different from x86 opcodes.
				828
				829	<p>
				830	A uinstr (type <code>UInstr</code>) contains
				831	various fields, not all of which are used by any one uopcode:
				832	<ul>
				833	<li>Three 16-bit operand fields, <code>val1</code>, <code>val2</code>
				834	and <code>val3</code>.
				835	<p>
				836	<li>Three tag fields, <code>tag1</code>, <code>tag2</code>
				837	and <code>tag3</code>. Each of these has a value of type
				838	<code>Tag</code>,
				839	and they describe what the <code>val1</code>, <code>val2</code>
				840	and <code>val3</code> fields contain.
				841	<p>
				842	<li>A 32-bit literal field.
				843	<p>
				844	<li>Two <code>FlagSet</code>s, specifying which x86 condition codes are
				845	read and written by the uinstr.
				846	<p>
				847	<li>An opcode byte, containing a value of type <code>Opcode</code>.
				848	<p>
				849	<li>A size field, indicating the data transfer size (1/2/4/8/10) in
				850	cases where this makes sense, or zero otherwise.
				851	<p>
				852	<li>A condition-code field, which, for jumps, holds a
				853	value of type <code>Condcode</code>, indicating the condition
				854	which applies. The encoding is as it is in the x86 insn stream,
				855	except we add a 17th value <code>CondAlways</code> to indicate
				856	an unconditional transfer.
				857	<p>
				858	<li>Various 1-bit flags, indicating whether this insn pertains to an
				859	x86 CALL or RET instruction, whether a widening is signed or not,
				860	etc.
				861	</ul>
				862
				863	<p>
				864	UOpcodes (type <code>Opcode</code>) are divided into two groups: those
				865	necessary merely to express the functionality of the x86 code, and
				866	extra uopcodes needed to express the instrumentation. The former
				867	group contains:
				868	<ul>
				869	<li><code>GET</code> and <code>PUT</code>, which move values from the
				870	simulated CPU's integer registers (<code>ArchReg</code>s) into
				871	<code>TempReg</code>s, and back. <code>GETF</code> and
				872	<code>PUTF</code> do the corresponding thing for the simulated
				873	<code>%EFLAGS</code>. There are no corresponding insns for the
				874	FPU register stack, since we don't explicitly simulate its
				875	registers.
				876	<p>
				877	<li><code>LOAD</code> and <code>STORE</code>, which, in RISC-like
				878	fashion, are the only uinstrs able to interact with memory.
				879	<p>
				880	<li><code>MOV</code> and <code>CMOV</code> allow unconditional and
				881	conditional moves of values between <code>TempReg</code>s.
				882	<p>
				883	<li>ALU operations. Again in RISC-like fashion, these only operate on
				884	<code>TempReg</code>s (before reg-alloc) or <code>RealReg</code>s
				885	(after reg-alloc). These are: <code>ADD</code>, <code>ADC</code>,
				886	<code>AND</code>, <code>OR</code>, <code>XOR</code>,
				887	<code>SUB</code>, <code>SBB</code>, <code>SHL</code>,
				888	<code>SHR</code>, <code>SAR</code>, <code>ROL</code>,
				889	<code>ROR</code>, <code>RCL</code>, <code>RCR</code>,
				890	<code>NOT</code>, <code>NEG</code>, <code>INC</code>,
				891	<code>DEC</code>, <code>BSWAP</code>, <code>CC2VAL</code> and
				892	<code>WIDEN</code>. <code>WIDEN</code> does signed or unsigned
				893	value widening. <code>CC2VAL</code> is used to convert condition
				894	codes into a value, zero or one. The rest are obvious.
				895	<p>
				896	To allow for more efficient code generation, we bend slightly the
				897	restriction at the start of the previous para: for
				898	<code>ADD</code>, <code>ADC</code>, <code>XOR</code>,
				899	<code>SUB</code> and <code>SBB</code>, we allow the first (source)
				900	operand to also be an <code>ArchReg</code>, that is, one of the
				901	simulated machine's registers. Also, many of these ALU ops allow
				902	the source operand to be a literal. See
				903	<code>VG_(saneUInstr)</code> for the final word on the allowable
				904	forms of uinstrs.
				905	<p>
				906	<li><code>LEA1</code> and <code>LEA2</code> are not strictly
				907	necessary, but allow faciliate better translations. They
				908	record the fancy x86 addressing modes in a direct way, which
				909	allows those amodes to be emitted back into the final
				910	instruction stream more or less verbatim.
				911	<p>
				912	<li><code>CALLM</code> calls a machine-code helper, one of the methods
				913	whose address is stored at some <code>VG_(baseBlock)</code>
				914	offset. <code>PUSH</code> and <code>POP</code> move values
				915	to/from <code>TempReg</code> to the real (Valgrind's) stack, and
				916	<code>CLEAR</code> removes values from the stack.
				917	<code>CALLM_S</code> and <code>CALLM_E</code> delimit the
				918	boundaries of call setups and clearings, for the benefit of the
				919	instrumentation passes. Getting this right is critical, and so
				920	<code>VG_(saneUCodeBlock)</code> makes various checks on the use
				921	of these uopcodes.
				922	<p>
				923	It is important to understand that these uopcodes have nothing to
				924	do with the x86 <code>call</code>, <code>return,</code>
				925	<code>push</code> or <code>pop</code> instructions, and are not
				926	used to implement them. Those guys turn into combinations of
				927	<code>GET</code>, <code>PUT</code>, <code>LOAD</code>,
				928	<code>STORE</code>, <code>ADD</code>, <code>SUB</code>, and
				929	<code>JMP</code>. What these uopcodes support is calling of
				930	helper functions such as <code>VG_(helper_imul_32_64)</code>,
				931	which do stuff which is too difficult or tedious to emit inline.
				932	<p>
				933	<li><code>FPU</code>, <code>FPU_R</code> and <code>FPU_W</code>.
				934	Valgrind doesn't attempt to simulate the internal state of the
				935	FPU at all. Consequently it only needs to be able to distinguish
				936	FPU ops which read and write memory from those that don't, and
				937	for those which do, it needs to know the effective address and
				938	data transfer size. This is made easier because the x86 FP
				939	instruction encoding is very regular, basically consisting of
				940	16 bits for a non-memory FPU insn and 11 (IIRC) bits + an address mode
				941	for a memory FPU insn. So our <code>FPU</code> uinstr carries
				942	the 16 bits in its <code>val1</code> field. And
				943	<code>FPU_R</code> and <code>FPU_W</code> carry 11 bits in that
				944	field, together with the identity of a <code>TempReg</code> or
				945	(later) <code>RealReg</code> which contains the address.
				946	<p>
				947	<li><code>JIFZ</code> is unique, in that it allows a control-flow
				948	transfer which is not deemed to end a basic block. It causes a
				949	jump to a literal (original) address if the specified argument
				950	is zero.
				951	<p>
				952	<li>Finally, <code>INCEIP</code> advances the simulated
				953	<code>%EIP</code> by the specified literal amount. This supports
				954	lazy <code>%EIP</code> updating, as described below.
				955	</ul>
				956
				957	<p>
				958	Stages 1 and 2 of the 6-stage translation process mentioned above
				959	deal purely with these uopcodes, and no others. They are
				960	sufficient to express pretty much all the x86 32-bit protected-mode
				961	instruction set, at
				962	least everything understood by a pre-MMX original Pentium (P54C).
				963
				964	<p>
				965	Stages 3, 4, 5 and 6 also deal with the following extra
				966	"instrumentation" uopcodes. They are used to express all the
				967	definedness-tracking and -checking machinery which valgrind does. In
				968	later sections we show how to create checking code for each of the
				969	uopcodes above. Note that these instrumentation uopcodes, although
				970	some appearing complicated, have been carefully chosen so that
				971	efficient x86 code can be generated for them. GNU superopt v2.5 did a
				972	great job helping out here. Anyways, the uopcodes are as follows:
				973
				974	<ul>
				975	<li><code>GETV</code> and <code>PUTV</code> are analogues to
				976	<code>GET</code> and <code>PUT</code> above. They are identical
				977	except that they move the V bits for the specified values back and
				978	forth to <code>TempRegs</code>, rather than moving the values
				979	themselves.
				980	<p>
				981	<li>Similarly, <code>LOADV</code> and <code>STOREV</code> read and
				982	write V bits from the synthesised shadow memory that Valgrind
				983	maintains. In fact they do more than that, since they also do
				984	address-validity checks, and emit complaints if the read/written
				985	addresses are unaddressible.
				986	<p>
				987	<li><code>TESTV</code>, whose parameters are a <code>TempReg</code>
				988	and a size, tests the V bits in the <code>TempReg</code>, at the
				989	specified operation size (0/1/2/4 byte) and emits an error if any
				990	of them indicate undefinedness. This is the only uopcode capable
				991	of doing such tests.
				992	<p>
				993	<li><code>SETV</code>, whose parameters are also <code>TempReg</code>
				994	and a size, makes the V bits in the <code>TempReg</code> indicated
				995	definedness, at the specified operation size. This is usually
				996	used to generate the correct V bits for a literal value, which is
				997	of course fully defined.
				998	<p>
				999	<li><code>GETVF</code> and <code>PUTVF</code> are analogues to
				1000	<code>GETF</code> and <code>PUTF</code>. They move the single V
				1001	bit used to model definedness of <code>%EFLAGS</code> between its
				1002	home in <code>VG_(baseBlock)</code> and the specified
				1003	<code>TempReg</code>.
				1004	<p>
				1005	<li><code>TAG1</code> denotes one of a family of unary operations on
				1006	<code>TempReg</code>s containing V bits. Similarly,
				1007	<code>TAG2</code> denotes one in a family of binary operations on
				1008	V bits.
				1009	</ul>
				1010
				1011	<p>
				1012	These 10 uopcodes are sufficient to express Valgrind's entire
				1013	definedness-checking semantics. In fact most of the interesting magic
				1014	is done by the <code>TAG1</code> and <code>TAG2</code>
				1015	suboperations.
				1016
				1017	<p>
				1018	First, however, I need to explain about V-vector operation sizes.
				1019	There are 4 sizes: 1, 2 and 4, which operate on groups of 8, 16 and 32
				1020	V bits at a time, supporting the usual 1, 2 and 4 byte x86 operations.
				1021	However there is also the mysterious size 0, which really means a
				1022	single V bit. Single V bits are used in various circumstances; in
				1023	particular, the definedness of <code>%EFLAGS</code> is modelled with a
				1024	single V bit. Now might be a good time to also point out that for
				1025	V bits, 1 means "undefined" and 0 means "defined". Similarly, for A
				1026	bits, 1 means "invalid address" and 0 means "valid address". This
				1027	seems counterintuitive (and so it is), but testing against zero on
				1028	x86s saves instructions compared to testing against all 1s, because
				1029	many ALU operations set the Z flag for free, so to speak.
				1030
				1031	<p>
				1032	With that in mind, the tag ops are:
				1033
				1034	<ul>
				1035	<li><b>(UNARY) Pessimising casts</b>: <code>VgT_PCast40</code>,
				1036	<code>VgT_PCast20</code>, <code>VgT_PCast10</code>,
				1037	<code>VgT_PCast01</code>, <code>VgT_PCast02</code> and
				1038	<code>VgT_PCast04</code>. A "pessimising cast" takes a V-bit
				1039	vector at one size, and creates a new one at another size,
				1040	pessimised in the sense that if any of the bits in the source
				1041	vector indicate undefinedness, then all the bits in the result
				1042	indicate undefinedness. In this case the casts are all to or from
				1043	a single V bit, so for example <code>VgT_PCast40</code> is a
				1044	pessimising cast from 32 bits to 1, whereas
				1045	<code>VgT_PCast04</code> simply copies the single source V bit
				1046	into all 32 bit positions in the result. Surprisingly, these ops
				1047	can all be implemented very efficiently.
				1048	<p>
				1049	There are also the pessimising casts <code>VgT_PCast14</code>,
				1050	from 8 bits to 32, <code>VgT_PCast12</code>, from 8 bits to 16,
				1051	and <code>VgT_PCast11</code>, from 8 bits to 8. This last one
				1052	seems nonsensical, but in fact it isn't a no-op because, as
				1053	mentioned above, any undefined (1) bits in the source infect the
				1054	entire result.
				1055	<p>
				1056	<li><b>(UNARY) Propagating undefinedness upwards in a word</b>:
				1057	<code>VgT_Left4</code>, <code>VgT_Left2</code> and
				1058	<code>VgT_Left1</code>. These are used to simulate the worst-case
				1059	effects of carry propagation in adds and subtracts. They return a
				1060	V vector identical to the original, except that if the original
				1061	contained any undefined bits, then it and all bits above it are
				1062	marked as undefined too. Hence the Left bit in the names.
				1063	<p>
				1064	<li><b>(UNARY) Signed and unsigned value widening</b>:
				1065	<code>VgT_SWiden14</code>, <code>VgT_SWiden24</code>,
				1066	<code>VgT_SWiden12</code>, <code>VgT_ZWiden14</code>,
				1067	<code>VgT_ZWiden24</code> and <code>VgT_ZWiden12</code>. These
				1068	mimic the definedness effects of standard signed and unsigned
				1069	integer widening. Unsigned widening creates zero bits in the new
				1070	positions, so <code>VgT_ZWiden*</code> accordingly park mark
				1071	those parts of their argument as defined. Signed widening copies
				1072	the sign bit into the new positions, so <code>VgT_SWiden*</code>
				1073	copies the definedness of the sign bit into the new positions.
				1074	Because 1 means undefined and 0 means defined, these operations
				1075	can (fascinatingly) be done by the same operations which they
				1076	mimic. Go figure.
				1077	<p>
				1078	<li><b>(BINARY) Undefined-if-either-Undefined,
				1079	Defined-if-either-Defined</b>: <code>VgT_UifU4</code>,
				1080	<code>VgT_UifU2</code>, <code>VgT_UifU1</code>,
				1081	<code>VgT_UifU0</code>, <code>VgT_DifD4</code>,
				1082	<code>VgT_DifD2</code>, <code>VgT_DifD1</code>. These do simple
				1083	bitwise operations on pairs of V-bit vectors, with
				1084	<code>UifU</code> giving undefined if either arg bit is
				1085	undefined, and <code>DifD</code> giving defined if either arg bit
				1086	is defined. Abstract interpretation junkies, if any make it this
				1087	far, may like to think of them as meets and joins (or is it joins
				1088	and meets) in the definedness lattices.
				1089	<p>
				1090	<li><b>(BINARY; one value, one V bits) Generate argument improvement
				1091	terms for AND and OR</b>: <code>VgT_ImproveAND4_TQ</code>,
				1092	<code>VgT_ImproveAND2_TQ</code>, <code>VgT_ImproveAND1_TQ</code>,
				1093	<code>VgT_ImproveOR4_TQ</code>, <code>VgT_ImproveOR2_TQ</code>,
				1094	<code>VgT_ImproveOR1_TQ</code>. These help out with AND and OR
				1095	operations. AND and OR have the inconvenient property that the
				1096	definedness of the result depends on the actual values of the
				1097	arguments as well as their definedness. At the bit level:
				1098	<br><code>1 AND undefined = undefined</code>, but
				1099	<br><code>0 AND undefined = 0</code>, and similarly
				1100	<br><code>0 OR undefined = undefined</code>, but
				1101	<br><code>1 OR undefined = 1</code>.
				1102	<br>
				1103	<p>
				1104	It turns out that gcc (quite legitimately) generates code which
				1105	relies on this fact, so we have to model it properly in order to
				1106	avoid flooding users with spurious value errors. The ultimate
				1107	definedness result of AND and OR is calculated using
				1108	<code>UifU</code> on the definedness of the arguments, but we
				1109	also <code>DifD</code> in some "improvement" terms which
				1110	take into account the above phenomena.
				1111	<p>
				1112	<code>ImproveAND</code> takes as its first argument the actual
				1113	value of an argument to AND (the T) and the definedness of that
				1114	argument (the Q), and returns a V-bit vector which is defined (0)
				1115	for bits which have value 0 and are defined; this, when
				1116	<code>DifD</code> into the final result causes those bits to be
				1117	defined even if the corresponding bit in the other argument is undefined.
				1118	<p>
				1119	The <code>ImproveOR</code> ops do the dual thing for OR
				1120	arguments. Note that XOR does not have this property that one
				1121	argument can make the other irrelevant, so there is no need for
				1122	such complexity for XOR.
				1123	</ul>
				1124
				1125	<p>
				1126	That's all the tag ops. If you stare at this long enough, and then
				1127	run Valgrind and stare at the pre- and post-instrumented ucode, it
				1128	should be fairly obvious how the instrumentation machinery hangs
				1129	together.
				1130
				1131	<p>
				1132	One point, if you do this: in order to make it easy to differentiate
				1133	<code>TempReg</code>s carrying values from <code>TempReg</code>s
				1134	carrying V bit vectors, Valgrind prints the former as (for example)
				1135	<code>t28</code> and the latter as <code>q28</code>; the fact that
				1136	they carry the same number serves to indicate their relationship.
				1137	This is purely for the convenience of the human reader; the register
				1138	allocator and code generator don't regard them as different.
				1139
				1140
				1141	<h3>Translation into UCode</h3>
				1142
				1143	<code>VG_(disBB)</code> allocates a new <code>UCodeBlock</code> and
				1144	then uses <code>disInstr</code> to translate x86 instructions one at a
				1145	time into UCode, dumping the result in the <code>UCodeBlock</code>.
				1146	This goes on until a control-flow transfer instruction is encountered.
				1147
				1148	<p>
				1149	Despite the large size of <code>vg_to_ucode.c</code>, this translation
				1150	is really very simple. Each x86 instruction is translated entirely
				1151	independently of its neighbours, merrily allocating new
				1152	<code>TempReg</code>s as it goes. The idea is to have a simple
				1153	translator -- in reality, no more than a macro-expander -- and the --
				1154	resulting bad UCode translation is cleaned up by the UCode
				1155	optimisation phase which follows. To give you an idea of some x86
				1156	instructions and their translations (this is a complete basic block,
				1157	as Valgrind sees it):
				1158	<pre>
				1159	0x40435A50: incl %edx
				1160
				1161	0: GETL %EDX, t0
				1162	1: INCL t0 (-wOSZAP)
				1163	2: PUTL t0, %EDX
				1164
				1165	0x40435A51: movsbl (%edx),%eax
				1166
				1167	3: GETL %EDX, t2
				1168	4: LDB (t2), t2
				1169	5: WIDENL_Bs t2
				1170	6: PUTL t2, %EAX
				1171
				1172	0x40435A54: testb $0x20, 1(%ecx,%eax,2)
				1173
				1174	7: GETL %EAX, t6
				1175	8: GETL %ECX, t8
				1176	9: LEA2L 1(t8,t6,2), t4
				1177	10: LDB (t4), t10
				1178	11: MOVB $0x20, t12
				1179	12: ANDB t12, t10 (-wOSZACP)
				1180	13: INCEIPo $9
				1181
				1182	0x40435A59: jnz-8 0x40435A50
				1183
				1184	14: Jnzo $0x40435A50 (-rOSZACP)
				1185	15: JMPo $0x40435A5B
				1186	</pre>
				1187
				1188	<p>
				1189	Notice how the block always ends with an unconditional jump to the
				1190	next block. This is a bit unnecessary, but makes many things simpler.
				1191
				1192	<p>
				1193	Most x86 instructions turn into sequences of <code>GET</code>,
				1194	<code>PUT</code>, <code>LEA1</code>, <code>LEA2</code>,
				1195	<code>LOAD</code> and <code>STORE</code>. Some complicated ones
				1196	however rely on calling helper bits of code in
				1197	<code>vg_helpers.S</code>. The ucode instructions <code>PUSH</code>,
				1198	<code>POP</code>, <code>CALL</code>, <code>CALLM_S</code> and
				1199	<code>CALLM_E</code> support this. The calling convention is somewhat
				1200	ad-hoc and is not the C calling convention. The helper routines must
				1201	save all integer registers, and the flags, that they use. Args are
				1202	passed on the stack underneath the return address, as usual, and if
				1203	result(s) are to be returned, it (they) are either placed in dummy arg
				1204	slots created by the ucode <code>PUSH</code> sequence, or just
				1205	overwrite the incoming args.
				1206
				1207	<p>
				1208	In order that the instrumentation mechanism can handle calls to these
				1209	helpers, <code>VG_(saneUCodeBlock)</code> enforces the following
				1210	restrictions on calls to helpers:
				1211
				1212	<ul>
				1213	<li>Each <code>CALL</code> uinstr must be bracketed by a preceding
				1214	<code>CALLM_S</code> marker (dummy uinstr) and a trailing
				1215	<code>CALLM_E</code> marker. These markers are used by the
				1216	instrumentation mechanism later to establish the boundaries of the
				1217	<code>PUSH</code>, <code>POP</code> and <code>CLEAR</code>
				1218	sequences for the call.
				1219	<p>
				1220	<li><code>PUSH</code>, <code>POP</code> and <code>CLEAR</code>
				1221	may only appear inside sections bracketed by <code>CALLM_S</code>
				1222	and <code>CALLM_E</code>, and nowhere else.
				1223	<p>
				1224	<li>In any such bracketed section, no two <code>PUSH</code> insns may
				1225	push the same <code>TempReg</code>. Dually, no two two
				1226	<code>POP</code>s may pop the same <code>TempReg</code>.
				1227	<p>
				1228	<li>Finally, although this is not checked, args should be removed from
				1229	the stack with <code>CLEAR</code>, rather than <code>POP</code>s
				1230	into a <code>TempReg</code> which is not subsequently used. This
				1231	is because the instrumentation mechanism assumes that all values
				1232	<code>POP</code>ped from the stack are actually used.
				1233	</ul>
				1234
				1235	Some of the translations may appear to have redundant
				1236	<code>TempReg</code>-to-<code>TempReg</code> moves. This helps the
				1237	next phase, UCode optimisation, to generate better code.
				1238
				1239
				1240
				1241	<h3>UCode optimisation</h3>
				1242
				1243	UCode is then subjected to an improvement pass
				1244	(<code>vg_improve()</code>), which blurs the boundaries between the
				1245	translations of the original x86 instructions. It's pretty
				1246	straightforward. Three transformations are done:
				1247
				1248	<ul>
				1249	<li>Redundant <code>GET</code> elimination. Actually, more general
				1250	than that -- eliminates redundant fetches of ArchRegs. In our
				1251	running example, uinstr 3 <code>GET</code>s <code>%EDX</code> into
				1252	<code>t2</code> despite the fact that, by looking at the previous
				1253	uinstr, it is already in <code>t0</code>. The <code>GET</code> is
				1254	therefore removed, and <code>t2</code> renamed to <code>t0</code>.
				1255	Assuming <code>t0</code> is allocated to a host register, it means
				1256	the simulated <code>%EDX</code> will exist in a host CPU register
				1257	for more than one simulated x86 instruction, which seems to me to
				1258	be a highly desirable property.
				1259	<p>
				1260	There is some mucking around to do with subregisters;
				1261	<code>%AL</code> vs <code>%AH</code> <code>%AX</code> vs
				1262	<code>%EAX</code> etc. I can't remember how it works, but in
				1263	general we are very conservative, and these tend to invalidate the
				1264	caching.
				1265	<p>
				1266	<li>Redundant <code>PUT</code> elimination. This annuls
				1267	<code>PUT</code>s of values back to simulated CPU registers if a
				1268	later <code>PUT</code> would overwrite the earlier
				1269	<code>PUT</code> value, and there is no intervening reads of the
				1270	simulated register (<code>ArchReg</code>).
				1271	<p>
				1272	As before, we are paranoid when faced with subregister references.
				1273	Also, <code>PUT</code>s of <code>%ESP</code> are never annulled,
				1274	because it is vital the instrumenter always has an up-to-date
				1275	<code>%ESP</code> value available, <code>%ESP</code> changes
				1276	affect addressibility of the memory around the simulated stack
				1277	pointer.
				1278	<p>
				1279	The implication of the above paragraph is that the simulated
				1280	machine's registers are only lazily updated once the above two
				1281	optimisation phases have run, with the exception of
				1282	<code>%ESP</code>. <code>TempReg</code>s go dead at the end of
				1283	every basic block, from which is is inferrable that any
				1284	<code>TempReg</code> caching a simulated CPU reg is flushed (back
				1285	into the relevant <code>VG_(baseBlock)</code> slot) at the end of
				1286	every basic block. The further implication is that the simulated
				1287	registers are only up-to-date at in between basic blocks, and not
				1288	at arbitrary points inside basic blocks. And the consequence of
				1289	that is that we can only deliver signals to the client in between
				1290	basic blocks. None of this seems any problem in practice.
				1291	<p>
				1292	<li>Finally there is a simple def-use thing for condition codes. If
				1293	an earlier uinstr writes the condition codes, and the next uinsn
				1294	along which actually cares about the condition codes writes the
				1295	same or larger set of them, but does not read any, the earlier
				1296	uinsn is marked as not writing any condition codes. This saves
				1297	a lot of redundant cond-code saving and restoring.
				1298	</ul>
				1299
				1300	The effect of these transformations on our short block is rather
				1301	unexciting, and shown below. On longer basic blocks they can
				1302	dramatically improve code quality.
				1303
				1304	<pre>
				1305	at 3: delete GET, rename t2 to t0 in (4 .. 6)
				1306	at 7: delete GET, rename t6 to t0 in (8 .. 9)
				1307	at 1: annul flag write OSZAP due to later OSZACP
				1308
				1309	Improved code:
				1310	0: GETL %EDX, t0
				1311	1: INCL t0
				1312	2: PUTL t0, %EDX
				1313	4: LDB (t0), t0
				1314	5: WIDENL_Bs t0
				1315	6: PUTL t0, %EAX
				1316	8: GETL %ECX, t8
				1317	9: LEA2L 1(t8,t0,2), t4
				1318	10: LDB (t4), t10
				1319	11: MOVB $0x20, t12
				1320	12: ANDB t12, t10 (-wOSZACP)
				1321	13: INCEIPo $9
				1322	14: Jnzo $0x40435A50 (-rOSZACP)
				1323	15: JMPo $0x40435A5B
				1324	</pre>
				1325
				1326	<h3>UCode instrumentation</h3>
				1327
				1328	Once you understand the meaning of the instrumentation uinstrs,
				1329	discussed in detail above, the instrumentation scheme is fairly
daywalker	7e73e5f	2003-07-04 16:18:15 +0000	[diff] [blame]	1330	straightforward. Each uinstr is instrumented in isolation, and the
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	1331	instrumentation uinstrs are placed before the original uinstr.
				1332	Our running example continues below. I have placed a blank line
				1333	after every original ucode, to make it easier to see which
				1334	instrumentation uinstrs correspond to which originals.
				1335
				1336	<p>
				1337	As mentioned somewhere above, <code>TempReg</code>s carrying values
				1338	have names like <code>t28</code>, and each one has a shadow carrying
				1339	its V bits, with names like <code>q28</code>. This pairing aids in
				1340	reading instrumented ucode.
				1341
				1342	<p>
				1343	One decision about all this is where to have "observation points",
				1344	that is, where to check that V bits are valid. I use a minimalistic
				1345	scheme, only checking where a failure of validity could cause the
				1346	original program to (seg)fault. So the use of values as memory
				1347	addresses causes a check, as do conditional jumps (these cause a check
				1348	on the definedness of the condition codes). And arguments
daywalker	7e73e5f	2003-07-04 16:18:15 +0000	[diff] [blame]	1349	<code>PUSH</code>ed for helper calls are checked, hence the weird
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	1350	restrictions on help call preambles described above.
				1351
				1352	<p>
				1353	Another decision is that once a value is tested, it is thereafter
				1354	regarded as defined, so that we do not emit multiple undefined-value
				1355	errors for the same undefined value. That means that
				1356	<code>TESTV</code> uinstrs are always followed by <code>SETV</code>
				1357	on the same (shadow) <code>TempReg</code>s. Most of these
				1358	<code>SETV</code>s are redundant and are removed by the
				1359	post-instrumentation cleanup phase.
				1360
				1361	<p>
				1362	The instrumentation for calling helper functions deserves further
				1363	comment. The definedness of results from a helper is modelled using
				1364	just one V bit. So, in short, we do pessimising casts of the
				1365	definedness of all the args, down to a single bit, and then
				1366	<code>UifU</code> these bits together. So this single V bit will say
				1367	"undefined" if any part of any arg is undefined. This V bit is then
				1368	pessimally cast back up to the result(s) sizes, as needed. If, by
				1369	seeing that all the args are got rid of with <code>CLEAR</code> and
				1370	none with <code>POP</code>, Valgrind sees that the result of the call
				1371	is not actually used, it immediately examines the result V bit with a
				1372	<code>TESTV</code> -- <code>SETV</code> pair. If it did not do this,
				1373	there would be no observation point to detect that the some of the
				1374	args to the helper were undefined. Of course, if the helper's results
				1375	are indeed used, we don't do this, since the result usage will
				1376	presumably cause the result definedness to be checked at some suitable
				1377	future point.
				1378
				1379	<p>
				1380	In general Valgrind tries to track definedness on a bit-for-bit basis,
				1381	but as the above para shows, for calls to helpers we throw in the
				1382	towel and approximate down to a single bit. This is because it's too
				1383	complex and difficult to track bit-level definedness through complex
				1384	ops such as integer multiply and divide, and in any case there is no
				1385	reasonable code fragments which attempt to (eg) multiply two
				1386	partially-defined values and end up with something meaningful, so
				1387	there seems little point in modelling multiplies, divides, etc, in
				1388	that level of detail.
				1389
				1390	<p>
				1391	Integer loads and stores are instrumented with firstly a test of the
				1392	definedness of the address, followed by a <code>LOADV</code> or
				1393	<code>STOREV</code> respectively. These turn into calls to
				1394	(for example) <code>VG_(helperc_LOADV4)</code>. These helpers do two
				1395	things: they perform an address-valid check, and they load or store V
				1396	bits from/to the relevant address in the (simulated V-bit) memory.
				1397
				1398	<p>
				1399	FPU loads and stores are different. As above the definedness of the
				1400	address is first tested. However, the helper routine for FPU loads
				1401	(<code>VGM_(fpu_read_check)</code>) emits an error if either the
				1402	address is invalid or the referenced area contains undefined values.
				1403	It has to do this because we do not simulate the FPU at all, and so
				1404	cannot track definedness of values loaded into it from memory, so we
				1405	have to check them as soon as they are loaded into the FPU, ie, at
				1406	this point. We notionally assume that everything in the FPU is
				1407	defined.
				1408
				1409	<p>
				1410	It follows therefore that FPU writes first check the definedness of
				1411	the address, then the validity of the address, and finally mark the
				1412	written bytes as well-defined.
				1413
				1414	<p>
				1415	If anyone is inspired to extend Valgrind to MMX/SSE insns, I suggest
				1416	you use the same trick. It works provided that the FPU/MMX unit is
				1417	not used to merely as a conduit to copy partially undefined data from
				1418	one place in memory to another. Unfortunately the integer CPU is used
				1419	like that (when copying C structs with holes, for example) and this is
				1420	the cause of much of the elaborateness of the instrumentation here
				1421	described.
				1422
				1423	<p>
				1424	<code>vg_instrument()</code> in <code>vg_translate.c</code> actually
				1425	does the instrumentation. There are comments explaining how each
				1426	uinstr is handled, so we do not repeat that here. As explained
				1427	already, it is bit-accurate, except for calls to helper functions.
				1428	Unfortunately the x86 insns <code>bt/bts/btc/btr</code> are done by
				1429	helper fns, so bit-level accuracy is lost there. This should be fixed
				1430	by doing them inline; it will probably require adding a couple new
				1431	uinstrs. Also, left and right rotates through the carry flag (x86
				1432	<code>rcl</code> and <code>rcr</code>) are approximated via a single
				1433	V bit; so far this has not caused anyone to complain. The
				1434	non-carry rotates, <code>rol</code> and <code>ror</code>, are much
				1435	more common and are done exactly. Re-visiting the instrumentation for
				1436	AND and OR, they seem rather verbose, and I wonder if it could be done
				1437	more concisely now.
				1438
				1439	<p>
				1440	The lowercase <code>o</code> on many of the uopcodes in the running
				1441	example indicates that the size field is zero, usually meaning a
				1442	single-bit operation.
				1443
				1444	<p>
				1445	Anyroads, the post-instrumented version of our running example looks
				1446	like this:
				1447
				1448	<pre>
				1449	Instrumented code:
				1450	0: GETVL %EDX, q0
				1451	1: GETL %EDX, t0
				1452
				1453	2: TAG1o q0 = Left4 ( q0 )
				1454	3: INCL t0
				1455
				1456	4: PUTVL q0, %EDX
				1457	5: PUTL t0, %EDX
				1458
				1459	6: TESTVL q0
				1460	7: SETVL q0
				1461	8: LOADVB (t0), q0
				1462	9: LDB (t0), t0
				1463
				1464	10: TAG1o q0 = SWiden14 ( q0 )
				1465	11: WIDENL_Bs t0
				1466
				1467	12: PUTVL q0, %EAX
				1468	13: PUTL t0, %EAX
				1469
				1470	14: GETVL %ECX, q8
				1471	15: GETL %ECX, t8
				1472
				1473	16: MOVL q0, q4
				1474	17: SHLL $0x1, q4
				1475	18: TAG2o q4 = UifU4 ( q8, q4 )
				1476	19: TAG1o q4 = Left4 ( q4 )
				1477	20: LEA2L 1(t8,t0,2), t4
				1478
				1479	21: TESTVL q4
				1480	22: SETVL q4
				1481	23: LOADVB (t4), q10
				1482	24: LDB (t4), t10
				1483
				1484	25: SETVB q12
				1485	26: MOVB $0x20, t12
				1486
				1487	27: MOVL q10, q14
				1488	28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
				1489	29: TAG2o q10 = UifU1 ( q12, q10 )
				1490	30: TAG2o q10 = DifD1 ( q14, q10 )
				1491	31: MOVL q12, q14
				1492	32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 )
				1493	33: TAG2o q10 = DifD1 ( q14, q10 )
				1494	34: MOVL q10, q16
				1495	35: TAG1o q16 = PCast10 ( q16 )
				1496	36: PUTVFo q16
				1497	37: ANDB t12, t10 (-wOSZACP)
				1498
				1499	38: INCEIPo $9
				1500
				1501	39: GETVFo q18
				1502	40: TESTVo q18
				1503	41: SETVo q18
				1504	42: Jnzo $0x40435A50 (-rOSZACP)
				1505
				1506	43: JMPo $0x40435A5B
				1507	</pre>
				1508
				1509
				1510	<h3>UCode post-instrumentation cleanup</h3>
				1511
				1512	<p>
				1513	This pass, coordinated by <code>vg_cleanup()</code>, removes redundant
				1514	definedness computation created by the simplistic instrumentation
				1515	pass. It consists of two passes,
				1516	<code>vg_propagate_definedness()</code> followed by
				1517	<code>vg_delete_redundant_SETVs</code>.
				1518
				1519	<p>
				1520	<code>vg_propagate_definedness()</code> is a simple
				1521	constant-propagation and constant-folding pass. It tries to determine
				1522	which <code>TempReg</code>s containing V bits will always indicate
				1523	"fully defined", and it propagates this information as far as it can,
				1524	and folds out as many operations as possible. For example, the
				1525	instrumentation for an ADD of a literal to a variable quantity will be
				1526	reduced down so that the definedness of the result is simply the
				1527	definedness of the variable quantity, since the literal is by
				1528	definition fully defined.
				1529
				1530	<p>
				1531	<code>vg_delete_redundant_SETVs</code> removes <code>SETV</code>s on
				1532	shadow <code>TempReg</code>s for which the next action is a write.
				1533	I don't think there's anything else worth saying about this; it is
				1534	simple. Read the sources for details.
				1535
				1536	<p>
				1537	So the cleaned-up running example looks like this. As above, I have
				1538	inserted line breaks after every original (non-instrumentation) uinstr
				1539	to aid readability. As with straightforward ucode optimisation, the
				1540	results in this block are undramatic because it is so short; longer
				1541	blocks benefit more because they have more redundancy which gets
				1542	eliminated.
				1543
				1544
				1545	<pre>
				1546	at 29: delete UifU1 due to defd arg1
				1547	at 32: change ImproveAND1_TQ to MOV due to defd arg2
				1548	at 41: delete SETV
				1549	at 31: delete MOV
				1550	at 25: delete SETV
				1551	at 22: delete SETV
				1552	at 7: delete SETV
				1553
				1554	0: GETVL %EDX, q0
				1555	1: GETL %EDX, t0
				1556
				1557	2: TAG1o q0 = Left4 ( q0 )
				1558	3: INCL t0
				1559
				1560	4: PUTVL q0, %EDX
				1561	5: PUTL t0, %EDX
				1562
				1563	6: TESTVL q0
				1564	8: LOADVB (t0), q0
				1565	9: LDB (t0), t0
				1566
				1567	10: TAG1o q0 = SWiden14 ( q0 )
				1568	11: WIDENL_Bs t0
				1569
				1570	12: PUTVL q0, %EAX
				1571	13: PUTL t0, %EAX
				1572
				1573	14: GETVL %ECX, q8
				1574	15: GETL %ECX, t8
				1575
				1576	16: MOVL q0, q4
				1577	17: SHLL $0x1, q4
				1578	18: TAG2o q4 = UifU4 ( q8, q4 )
				1579	19: TAG1o q4 = Left4 ( q4 )
				1580	20: LEA2L 1(t8,t0,2), t4
				1581
				1582	21: TESTVL q4
				1583	23: LOADVB (t4), q10
				1584	24: LDB (t4), t10
				1585
				1586	26: MOVB $0x20, t12
				1587
				1588	27: MOVL q10, q14
				1589	28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
				1590	30: TAG2o q10 = DifD1 ( q14, q10 )
				1591	32: MOVL t12, q14
				1592	33: TAG2o q10 = DifD1 ( q14, q10 )
				1593	34: MOVL q10, q16
				1594	35: TAG1o q16 = PCast10 ( q16 )
				1595	36: PUTVFo q16
				1596	37: ANDB t12, t10 (-wOSZACP)
				1597
				1598	38: INCEIPo $9
				1599	39: GETVFo q18
				1600	40: TESTVo q18
				1601	42: Jnzo $0x40435A50 (-rOSZACP)
				1602
				1603	43: JMPo $0x40435A5B
				1604	</pre>
				1605
				1606
				1607	<h3>Translation from UCode</h3>
				1608
				1609	This is all very simple, even though <code>vg_from_ucode.c</code>
				1610	is a big file. Position-independent x86 code is generated into
				1611	a dynamically allocated array <code>emitted_code</code>; this is
				1612	doubled in size when it overflows. Eventually the array is handed
				1613	back to the caller of <code>VG_(translate)</code>, who must copy
				1614	the result into TC and TT, and free the array.
				1615
				1616	<p>
				1617	This file is structured into four layers of abstraction, which,
				1618	thankfully, are glued back together with extensive
				1619	<code>__inline__</code> directives. From the bottom upwards:
				1620
				1621	<ul>
				1622	<li>Address-mode emitters, <code>emit_amode_regmem_reg</code> et al.
				1623	<p>
				1624	<li>Emitters for specific x86 instructions. There are quite a lot of
				1625	these, with names such as <code>emit_movv_offregmem_reg</code>.
				1626	The <code>v</code> suffix is Intel parlance for a 16/32 bit insn;
				1627	there are also <code>b</code> suffixes for 8 bit insns.
				1628	<p>
				1629	<li>The next level up are the <code>synth_*</code> functions, which
				1630	synthesise possibly a sequence of raw x86 instructions to do some
				1631	simple task. Some of these are quite complex because they have to
				1632	work around Intel's silly restrictions on subregister naming. See
				1633	<code>synth_nonshiftop_reg_reg</code> for example.
				1634	<p>
				1635	<li>Finally, at the top of the heap, we have
				1636	<code>emitUInstr()</code>,
				1637	which emits code for a single uinstr.
				1638	</ul>
				1639
				1640	<p>
				1641	Some comments:
				1642	<ul>
				1643	<li>The hack for FPU instructions becomes apparent here. To do a
				1644	<code>FPU</code> ucode instruction, we load the simulated FPU's
				1645	state into from its <code>VG_(baseBlock)</code> into the real FPU
				1646	using an x86 <code>frstor</code> insn, do the ucode
				1647	<code>FPU</code> insn on the real CPU, and write the updated FPU
				1648	state back into <code>VG_(baseBlock)</code> using an
				1649	<code>fnsave</code> instruction. This is pretty brutal, but is
				1650	simple and it works, and even seems tolerably efficient. There is
				1651	no attempt to cache the simulated FPU state in the real FPU over
				1652	multiple back-to-back ucode FPU instructions.
				1653	<p>
				1654	<code>FPU_R</code> and <code>FPU_W</code> are also done this way,
				1655	with the minor complication that we need to patch in some
				1656	addressing mode bits so the resulting insn knows the effective
				1657	address to use. This is easy because of the regularity of the x86
				1658	FPU instruction encodings.
				1659	<p>
				1660	<li>An analogous trick is done with ucode insns which claim, in their
				1661	<code>flags_r</code> and <code>flags_w</code> fields, that they
				1662	read or write the simulated <code>%EFLAGS</code>. For such cases
				1663	we first copy the simulated <code>%EFLAGS</code> into the real
				1664	<code>%eflags</code>, then do the insn, then, if the insn says it
				1665	writes the flags, copy back to <code>%EFLAGS</code>. This is a
				1666	bit expensive, which is why the ucode optimisation pass goes to
				1667	some effort to remove redundant flag-update annotations.
				1668	</ul>
				1669
				1670	<p>
				1671	And so ... that's the end of the documentation for the instrumentating
				1672	translator! It's really not that complex, because it's composed as a
				1673	sequence of simple(ish) self-contained transformations on
				1674	straight-line blocks of code.
				1675
				1676
				1677	<h3>Top-level dispatch loop</h3>
				1678
				1679	Urk. In <code>VG_(toploop)</code>. This is basically boring and
				1680	unsurprising, not to mention fiddly and fragile. It needs to be
				1681	cleaned up.
				1682
				1683	<p>
				1684	The only perhaps surprise is that the whole thing is run
				1685	on top of a <code>setjmp</code>-installed exception handler, because,
				1686	supposing a translation got a segfault, we have to bail out of the
				1687	Valgrind-supplied exception handler <code>VG_(oursignalhandler)</code>
				1688	and immediately start running the client's segfault handler, if it has
				1689	one. In particular we can't finish the current basic block and then
				1690	deliver the signal at some convenient future point, because signals
				1691	like SIGILL, SIGSEGV and SIGBUS mean that the faulting insn should not
				1692	simply be re-tried. (I'm sure there is a clearer way to explain this).
				1693
				1694
				1695	<h3>Exceptions, creating new translations</h3>
				1696	<h3>Self-modifying code</h3>
				1697
				1698	<h3>Lazy updates of the simulated program counter</h3>
				1699
				1700	Simulated <code>%EIP</code> is not updated after every simulated x86
				1701	insn as this was regarded as too expensive. Instead ucode
				1702	<code>INCEIP</code> insns move it along as and when necessary.
				1703	Currently we don't allow it to fall more than 4 bytes behind reality
				1704	(see <code>VG_(disBB)</code> for the way this works).
				1705	<p>
				1706	Note that <code>%EIP</code> is always brought up to date by the inner
				1707	dispatch loop in <code>VG_(dispatch)</code>, so that if the client
				1708	takes a fault we know at least which basic block this happened in.
				1709
				1710
				1711	<h3>The translation cache and translation table</h3>
				1712
				1713	<h3>Signals</h3>
				1714
				1715	Horrible, horrible. <code>vg_signals.c</code>.
				1716	Basically, since we have to intercept all system
				1717	calls anyway, we can see when the client tries to install a signal
				1718	handler. If it does so, we make a note of what the client asked to
				1719	happen, and ask the kernel to route the signal to our own signal
				1720	handler, <code>VG_(oursignalhandler)</code>. This simply notes the
				1721	delivery of signals, and returns.
				1722
				1723	<p>
				1724	Every 1000 basic blocks, we see if more signals have arrived. If so,
				1725	<code>VG_(deliver_signals)</code> builds signal delivery frames on the
				1726	client's stack, and allows their handlers to be run. Valgrind places
				1727	in these signal delivery frames a bogus return address,
njn	3e87f7e	2003-04-08 11:08:45 +0000	[diff] [blame]	1728	<code>VG_(signalreturn_bogusRA)</code>, and checks all jumps to see
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	1729	if any jump to it. If so, this is a sign that a signal handler is
				1730	returning, and if so Valgrind removes the relevant signal frame from
				1731	the client's stack, restores the from the signal frame the simulated
				1732	state before the signal was delivered, and allows the client to run
				1733	onwards. We have to do it this way because some signal handlers never
				1734	return, they just <code>longjmp()</code>, which nukes the signal
				1735	delivery frame.
				1736
				1737	<p>
				1738	The Linux kernel has a different but equally horrible hack for
				1739	detecting signal handler returns. Discovering it is left as an
				1740	exercise for the reader.
				1741
				1742
				1743
				1744	<h3>Errors, error contexts, error reporting, suppressions</h3>
				1745	<h3>Client malloc/free</h3>
				1746	<h3>Low-level memory management</h3>
				1747	<h3>A and V bitmaps</h3>
				1748	<h3>Symbol table management</h3>
				1749	<h3>Dealing with system calls</h3>
				1750	<h3>Namespace management</h3>
				1751	<h3>GDB attaching</h3>
				1752	<h3>Non-dependence on glibc or anything else</h3>
				1753	<h3>The leak detector</h3>
				1754	<h3>Performance problems</h3>
				1755	<h3>Continuous sanity checking</h3>
				1756	<h3>Tracing, or not tracing, child processes</h3>
				1757	<h3>Assembly glue for syscalls</h3>
				1758
				1759
				1760	<hr width="100%">
				1761
				1762	<h2>Extensions</h2>
				1763
				1764	Some comments about Stuff To Do.
				1765
				1766	<h3>Bugs</h3>
				1767
				1768	Stephan Kulow and Marc Mutz report problems with kmail in KDE 3 CVS
				1769	(RC2 ish) when run on Valgrind. Stephan has it deadlocking; Marc has
				1770	it looping at startup. I can't repro either behaviour. Needs
				1771	repro-ing and fixing.
				1772
				1773
				1774	<h3>Threads</h3>
				1775
				1776	Doing a good job of thread support strikes me as almost a
				1777	research-level problem. The central issues are how to do fast cheap
				1778	locking of the <code>VG_(primary_map)</code> structure, whether or not
				1779	accesses to the individual secondary maps need locking, what
				1780	race-condition issues result, and whether the already-nasty mess that
				1781	is the signal simulator needs further hackery.
				1782
				1783	<p>
				1784	I realise that threads are the most-frequently-requested feature, and
				1785	I am thinking about it all. If you have guru-level understanding of
				1786	fast mutual exclusion mechanisms and race conditions, I would be
				1787	interested in hearing from you.
				1788
				1789
				1790	<h3>Verification suite</h3>
				1791
				1792	Directory <code>tests/</code> contains various ad-hoc tests for
				1793	Valgrind. However, there is no systematic verification or regression
				1794	suite, that, for example, exercises all the stuff in
				1795	<code>vg_memory.c</code>, to ensure that illegal memory accesses and
				1796	undefined value uses are detected as they should be. It would be good
				1797	to have such a suite.
				1798
				1799
				1800	<h3>Porting to other platforms</h3>
				1801
				1802	It would be great if Valgrind was ported to FreeBSD and x86 NetBSD,
				1803	and to x86 OpenBSD, if it's possible (doesn't OpenBSD use a.out-style
				1804	executables, not ELF ?)
				1805
				1806	<p>
				1807	The main difficulties, for an x86-ELF platform, seem to be:
				1808
				1809	<ul>
				1810	<li>You'd need to rewrite the <code>/proc/self/maps</code> parser
				1811	(<code>vg_procselfmaps.c</code>).
				1812	Easy.
				1813	<p>
				1814	<li>You'd need to rewrite <code>vg_syscall_mem.c</code>, or, more
				1815	specifically, provide one for your OS. This is tedious, but you
				1816	can implement syscalls on demand, and the Linux kernel interface
				1817	is, for the most part, going to look very similar to the *BSD
				1818	interfaces, so it's really a copy-paste-and-modify-on-demand job.
				1819	As part of this, you'd need to supply a new
				1820	<code>vg_kerneliface.h</code> file.
				1821	<p>
				1822	<li>You'd also need to change the syscall wrappers for Valgrind's
				1823	internal use, in <code>vg_mylibc.c</code>.
				1824	</ul>
				1825
				1826	All in all, I think a port to x86-ELF *BSDs is not really very
				1827	difficult, and in some ways I would like to see it happen, because
				1828	that would force a more clear factoring of Valgrind into platform
				1829	dependent and independent pieces. Not to mention, *BSD folks also
				1830	deserve to use Valgrind just as much as the Linux crew do.
				1831
				1832
				1833	<p>
				1834	<hr width="100%">
				1835
				1836	<h2>Easy stuff which ought to be done</h2>
				1837
				1838	<h3>MMX instructions</h3>
				1839
				1840	MMX insns should be supported, using the same trick as for FPU insns.
				1841	If the MMX registers are not used to copy uninitialised junk from one
				1842	place to another in memory, this means we don't have to actually
				1843	simulate the internal MMX unit state, so the FPU hack applies. This
				1844	should be fairly easy.
				1845
				1846
				1847
				1848	<h3>Fix stabs-info reader</h3>
				1849
				1850	The machinery in <code>vg_symtab2.c</code> which reads "stabs" style
				1851	debugging info is pretty weak. It usually correctly translates
				1852	simulated program counter values into line numbers and procedure
				1853	names, but the file name is often completely wrong. I think the
				1854	logic used to parse "stabs" entries is weak. It should be fixed.
				1855	The simplest solution, IMO, is to copy either the logic or simply the
				1856	code out of GNU binutils which does this; since GDB can clearly get it
				1857	right, binutils (or GDB?) must have code to do this somewhere.
				1858
				1859
				1860
				1861
				1862
				1863	<h3>BT/BTC/BTS/BTR</h3>
				1864
				1865	These are x86 instructions which test, complement, set, or reset, a
				1866	single bit in a word. At the moment they are both incorrectly
				1867	implemented and incorrectly instrumented.
				1868
				1869	<p>
				1870	The incorrect instrumentation is due to use of helper functions. This
				1871	means we lose bit-level definedness tracking, which could wind up
				1872	giving spurious uninitialised-value use errors. The Right Thing to do
				1873	is to invent a couple of new UOpcodes, I think <code>GET_BIT</code>
				1874	and <code>SET_BIT</code>, which can be used to implement all 4 x86
				1875	insns, get rid of the helpers, and give bit-accurate instrumentation
				1876	rules for the two new UOpcodes.
				1877
				1878	<p>
				1879	I realised the other day that they are mis-implemented too. The x86
				1880	insns take a bit-index and a register or memory location to access.
				1881	For registers the bit index clearly can only be in the range zero to
				1882	register-width minus 1, and I assumed the same applied to memory
				1883	locations too. But evidently not; for memory locations the index can
				1884	be arbitrary, and the processor will index arbitrarily into memory as
				1885	a result. This too should be fixed. Sigh. Presumably indexing
				1886	outside the immediate word is not actually used by any programs yet
				1887	tested on Valgrind, for otherwise they (presumably) would simply not
				1888	work at all. If you plan to hack on this, first check the Intel docs
				1889	to make sure my understanding is really correct.
				1890
				1891
				1892
				1893	<h3>Using PREFETCH instructions</h3>
				1894
				1895	Here's a small but potentially interesting project for performance
				1896	junkies. Experiments with valgrind's code generator and optimiser(s)
				1897	suggest that reducing the number of instructions executed in the
				1898	translations and mem-check helpers gives disappointingly small
				1899	performance improvements. Perhaps this is because performance of
				1900	Valgrindified code is limited by cache misses. After all, each read
				1901	in the original program now gives rise to at least three reads, one
				1902	for the <code>VG_(primary_map)</code>, one of the resulting
				1903	secondary, and the original. Not to mention, the instrumented
				1904	translations are 13 to 14 times larger than the originals. All in all
				1905	one would expect the memory system to be hammered to hell and then
				1906	some.
				1907
				1908	<p>
				1909	So here's an idea. An x86 insn involving a read from memory, after
				1910	instrumentation, will turn into ucode of the following form:
				1911	<pre>
				1912	... calculate effective addr, into ta and qa ...
				1913	TESTVL qa -- is the addr defined?
				1914	LOADV (ta), qloaded -- fetch V bits for the addr
				1915	LOAD (ta), tloaded -- do the original load
				1916	</pre>
				1917	At the point where the <code>LOADV</code> is done, we know the actual
				1918	address (<code>ta</code>) from which the real <code>LOAD</code> will
				1919	be done. We also know that the <code>LOADV</code> will take around
				1920	20 x86 insns to do. So it seems plausible that doing a prefetch of
				1921	<code>ta</code> just before the <code>LOADV</code> might just avoid a
				1922	miss at the <code>LOAD</code> point, and that might be a significant
				1923	performance win.
				1924
				1925	<p>
				1926	Prefetch insns are notoriously tempermental, more often than not
				1927	making things worse rather than better, so this would require
				1928	considerable fiddling around. It's complicated because Intels and
				1929	AMDs have different prefetch insns with different semantics, so that
				1930	too needs to be taken into account. As a general rule, even placing
				1931	the prefetches before the <code>LOADV</code> insn is too near the
				1932	<code>LOAD</code>; the ideal distance is apparently circa 200 CPU
				1933	cycles. So it might be worth having another analysis/transformation
				1934	pass which pushes prefetches as far back as possible, hopefully
				1935	immediately after the effective address becomes available.
				1936
				1937	<p>
				1938	Doing too many prefetches is also bad because they soak up bus
				1939	bandwidth / cpu resources, so some cleverness in deciding which loads
				1940	to prefetch and which to not might be helpful. One can imagine not
				1941	prefetching client-stack-relative (<code>%EBP</code> or
				1942	<code>%ESP</code>) accesses, since the stack in general tends to show
				1943	good locality anyway.
				1944
				1945	<p>
				1946	There's quite a lot of experimentation to do here, but I think it
				1947	might make an interesting week's work for someone.
				1948
				1949	<p>
				1950	As of 15-ish March 2002, I've started to experiment with this, using
				1951	the AMD <code>prefetch/prefetchw</code> insns.
				1952
				1953
				1954
				1955	<h3>User-defined permission ranges</h3>
				1956
				1957	This is quite a large project -- perhaps a month's hacking for a
				1958	capable hacker to do a good job -- but it's potentially very
				1959	interesting. The outcome would be that Valgrind could detect a
				1960	whole class of bugs which it currently cannot.
				1961
				1962	<p>
				1963	The presentation falls into two pieces.
				1964
				1965	<p>
				1966	<b>Part 1: user-defined address-range permission setting</b>
				1967	<p>
				1968
				1969	Valgrind intercepts the client's <code>malloc</code>,
				1970	<code>free</code>, etc calls, watches system calls, and watches the
				1971	stack pointer move. This is currently the only way it knows about
				1972	which addresses are valid and which not. Sometimes the client program
				1973	knows extra information about its memory areas. For example, the
				1974	client could at some point know that all elements of an array are
				1975	out-of-date. We would like to be able to convey to Valgrind this
				1976	information that the array is now addressable-but-uninitialised, so
				1977	that Valgrind can then warn if elements are used before they get new
				1978	values.
				1979
				1980	<p>
				1981	What I would like are some macros like this:
				1982	<pre>
				1983	VALGRIND_MAKE_NOACCESS(addr, len)
				1984	VALGRIND_MAKE_WRITABLE(addr, len)
				1985	VALGRIND_MAKE_READABLE(addr, len)
				1986	</pre>
				1987	and also, to check that memory is addressible/initialised,
				1988	<pre>
				1989	VALGRIND_CHECK_ADDRESSIBLE(addr, len)
				1990	VALGRIND_CHECK_INITIALISED(addr, len)
				1991	</pre>
				1992
				1993	<p>
				1994	I then include in my sources a header defining these macros, rebuild
				1995	my app, run under Valgrind, and get user-defined checks.
				1996
				1997	<p>
				1998	Now here's a neat trick. It's a nuisance to have to re-link the app
				1999	with some new library which implements the above macros. So the idea
				2000	is to define the macros so that the resulting executable is still
				2001	completely stand-alone, and can be run without Valgrind, in which case
				2002	the macros do nothing, but when run on Valgrind, the Right Thing
				2003	happens. How to do this? The idea is for these macros to turn into a
				2004	piece of inline assembly code, which (1) has no effect when run on the
				2005	real CPU, (2) is easily spotted by Valgrind's JITter, and (3) no sane
				2006	person would ever write, which is important for avoiding false matches
				2007	in (2). So here's a suggestion:
				2008	<pre>
				2009	VALGRIND_MAKE_NOACCESS(addr, len)
				2010	</pre>
				2011	becomes (roughly speaking)
				2012	<pre>
				2013	movl addr, %eax
				2014	movl len, %ebx
				2015	movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be
				2016	-- 2, etc
				2017	rorl $13, %ecx
				2018	rorl $19, %ecx
				2019	rorl $11, %eax
				2020	rorl $21, %eax
				2021	</pre>
				2022	The rotate sequences have no effect, and it's unlikely they would
				2023	appear for any other reason, but they define a unique byte-sequence
				2024	which the JITter can easily spot. Using the operand constraints
				2025	section at the end of a gcc inline-assembly statement, we can tell gcc
				2026	that the assembly fragment kills <code>%eax</code>, <code>%ebx</code>,
				2027	<code>%ecx</code> and the condition codes, so this fragment is made
				2028	harmless when not running on Valgrind, runs quickly when not on
				2029	Valgrind, and does not require any other library support.
				2030
				2031
				2032	<p>
				2033	<b>Part 2: using it to detect interference between stack variables</b>
				2034	<p>
				2035
				2036	Currently Valgrind cannot detect errors of the following form:
				2037	<pre>
				2038	void fooble ( void )
				2039	{
				2040	int a[10];
				2041	int b[10];
				2042	a[10] = 99;
				2043	}
				2044	</pre>
				2045	Now imagine rewriting this as
				2046	<pre>
				2047	void fooble ( void )
				2048	{
				2049	int spacer0;
				2050	int a[10];
				2051	int spacer1;
				2052	int b[10];
				2053	int spacer2;
njn	3e87f7e	2003-04-08 11:08:45 +0000	[diff] [blame]	2054	VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
				2055	VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
				2056	VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
sewardj	a9a2dcf	2002-11-11 00:20:07 +0000	[diff] [blame]	2057	a[10] = 99;
				2058	}
				2059	</pre>
				2060	Now the invalid write is certain to hit <code>spacer0</code> or
				2061	<code>spacer1</code>, so Valgrind will spot the error.
				2062
				2063	<p>
				2064	There are two complications.
				2065
				2066	<p>
				2067	The first is that we don't want to annotate sources by hand, so the
				2068	Right Thing to do is to write a C/C++ parser, annotator, prettyprinter
				2069	which does this automatically, and run it on post-CPP'd C/C++ source.
				2070	See http://www.cacheprof.org for an example of a system which
				2071	transparently inserts another phase into the gcc/g++ compilation
				2072	route. The parser/prettyprinter is probably not as hard as it sounds;
				2073	I would write it in Haskell, a powerful functional language well
				2074	suited to doing symbolic computation, with which I am intimately
				2075	familar. There is already a C parser written in Haskell by someone in
				2076	the Haskell community, and that would probably be a good starting
				2077	point.
				2078
				2079	<p>
				2080	The second complication is how to get rid of these
				2081	<code>NOACCESS</code> records inside Valgrind when the instrumented
				2082	function exits; after all, these refer to stack addresses and will
				2083	make no sense whatever when some other function happens to re-use the
				2084	same stack address range, probably shortly afterwards. I think I
				2085	would be inclined to define a special stack-specific macro
				2086	<pre>
				2087	VALGRIND_MAKE_NOACCESS_STACK(addr, len)
				2088	</pre>
				2089	which causes Valgrind to record the client's <code>%ESP</code> at the
				2090	time it is executed. Valgrind will then watch for changes in
				2091	<code>%ESP</code> and discard such records as soon as the protected
				2092	area is uncovered by an increase in <code>%ESP</code>. I hesitate
				2093	with this scheme only because it is potentially expensive, if there
				2094	are hundreds of such records, and considering that changes in
				2095	<code>%ESP</code> already require expensive messing with stack access
				2096	permissions.
				2097
				2098	<p>
				2099	This is probably easier and more robust than for the instrumenter
				2100	program to try and spot all exit points for the procedure and place
				2101	suitable deallocation annotations there. Plus C++ procedures can
				2102	bomb out at any point if they get an exception, so spotting return
				2103	points at the source level just won't work at all.
				2104
				2105	<p>
				2106	Although some work, it's all eminently doable, and it would make
				2107	Valgrind into an even-more-useful tool.
				2108
				2109
				2110	<p>
				2111
				2112	</body>
				2113	</html>