Blame - memcheck/docs/mc-tech-docs.xml - platform/external/valgrind

blob: 33146cecf768791530364f1d745267257c3ace1b [file] [log] [blame]

njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
				4
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	5
				6	<chapter id="mc-tech-docs"
				7	xreflabel="The design and implementation of Valgrind">
				8
				9	<title>The Design and Implementation of Valgrind</title>
				10	<subtitle>Detailed technical notes for hackers, maintainers and
				11	the overly-curious</subtitle>
				12
				13	<sect1 id="mc-tech-docs.intro" xreflabel="Introduction">
				14	<title>Introduction</title>
				15
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	16	<para>This document contains a detailed, highly-technical description of
				17	the internals of Valgrind. This is not the user manual; if you are an
				18	end-user of Valgrind, you do not want to read this. Conversely, if you
				19	really are a hacker-type and want to know how it works, I assume that
				20	you have read the user manual thoroughly.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	21
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	22	<para>You may need to read this document several times, and carefully.
				23	Some important things, I only say once.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	24
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	25	<para>[Note: this document is now very old, and a lot of its contents
				26	are out of date, and misleading.]</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	27
				28
				29	<sect2 id="mc-tech-docs.history" xreflabel="History">
				30	<title>History</title>
				31
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	32	<para>Valgrind came into public view in late Feb 2002. However, it has
				33	been under contemplation for a very long time, perhaps seriously for
				34	about five years. Somewhat over two years ago, I started working on the
				35	x86 code generator for the Glasgow Haskell Compiler
				36	(http://www.haskell.org/ghc), gaining familiarity with x86 internals on
				37	the way. I then did Cacheprof, gaining further x86 experience. Some
				38	time around Feb 2000 I started experimenting with a user-space x86
				39	interpreter for x86-Linux. This worked, but it was clear that a
				40	JIT-based scheme would be necessary to give reasonable performance for
				41	Valgrind. Design work for the JITter started in earnest in Oct 2000,
				42	and by early 2001 I had an x86-to-x86 dynamic translator which could run
				43	quite large programs. This translator was in a sense pointless, since
				44	it did not do any instrumentation or checking.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	45
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	46	<para>Most of the rest of 2001 was taken up designing and implementing
				47	the instrumentation scheme. The main difficulty, which consumed a lot
				48	of effort, was to design a scheme which did not generate large numbers
				49	of false uninitialised-value warnings. By late 2001 a satisfactory
				50	scheme had been arrived at, and I started to test it on ever-larger
				51	programs, with an eventual eye to making it work well enough so that it
				52	was helpful to folks debugging the upcoming version 3 of KDE. I've used
				53	KDE since before version 1.0, and wanted to Valgrind to be an indirect
				54	contribution to the KDE 3 development effort. At the start of Feb 02
				55	the kde-core-devel crew started using it, and gave a huge amount of
				56	helpful feedback and patches in the space of three weeks. Snapshot
				57	20020306 is the result.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	58
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	59	<para>In the best Unix tradition, or perhaps in the spirit of Fred
				60	Brooks' depressing-but-completely-accurate epitaph "build one to throw
				61	away; you will anyway", much of Valgrind is a second or third rendition
				62	of the initial idea. The instrumentation machinery
				63	(<filename>vg_translate.c</filename>, <filename>vg_memory.c</filename>)
				64	and core CPU simulation (<filename>vg_to_ucode.c</filename>,
				65	<filename>vg_from_ucode.c</filename>) have had three redesigns and
				66	rewrites; the register allocator, low-level memory manager
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	67	(<filename>vg_malloc2.c</filename>) and symbol table reader
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	68	(<filename>vg_symtab2.c</filename>) are on the second rewrite. In a
				69	sense, this document serves to record some of the knowledge gained as a
				70	result.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	71
				72	</sect2>
				73
				74
				75	<sect2 id="mc-tech-docs.overview" xreflabel="Design overview">
				76	<title>Design overview</title>
				77
				78	<para>Valgrind is compiled into a Linux shared object,
				79	<filename>valgrind.so</filename>, and also a dummy one,
				80	<filename>valgrinq.so</filename>, of which more later. The
				81	<filename>valgrind</filename> shell script adds
				82	<filename>valgrind.so</filename> to the
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	83	<computeroutput>LD_PRELOAD</computeroutput> list of extra libraries to
				84	be loaded with any dynamically linked library. This is a standard
				85	trick, one which I assume the
				86	<computeroutput>LD_PRELOAD</computeroutput> mechanism was developed to
				87	support.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	88
				89	<para><filename>valgrind.so</filename> is linked with the
				90	<computeroutput>-z initfirst</computeroutput> flag, which
				91	requests that its initialisation code is run before that of any
				92	other object in the executable image. When this happens,
				93	valgrind gains control. The real CPU becomes "trapped" in
				94	<filename>valgrind.so</filename> and the translations it
				95	generates. The synthetic CPU provided by Valgrind does, however,
				96	return from this initialisation function. So the normal startup
				97	actions, orchestrated by the dynamic linker
				98	<filename>ld.so</filename>, continue as usual, except on the
				99	synthetic CPU, not the real one. Eventually
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	100	<function>main</function> is run and returns, and
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	101	then the finalisation code of the shared objects is run,
				102	presumably in inverse order to which they were initialised.
				103	Remember, this is still all happening on the simulated CPU.
				104	Eventually <filename>valgrind.so</filename>'s own finalisation
				105	code is called. It spots this event, shuts down the simulated
				106	CPU, prints any error summaries and/or does leak detection, and
				107	returns from the initialisation code on the real CPU. At this
				108	point, in effect the real and synthetic CPUs have merged back
				109	into one, Valgrind has lost control of the program, and the
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	110	program finally <function>exit()s</function> back to
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	111	the kernel in the usual way.</para>
				112
				113	<para>The normal course of activity, once Valgrind has started
				114	up, is as follows. Valgrind never runs any part of your program
				115	(usually referred to as the "client"), not a single byte of it,
				116	directly. Instead it uses function
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	117	<function>VG_(translate)</function> to translate
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	118	basic blocks (BBs, straight-line sequences of code) into
				119	instrumented translations, and those are run instead. The
				120	translations are stored in the translation cache (TC),
				121	<computeroutput>vg_tc</computeroutput>, with the translation
				122	table (TT), <computeroutput>vg_tt</computeroutput> supplying the
				123	original-to-translation code address mapping. Auxiliary array
				124	<computeroutput>VG_(tt_fast)</computeroutput> is used as a
				125	direct-map cache for fast lookups in TT; it usually achieves a
				126	hit rate of around 98% and facilitates an orig-to-trans lookup in
				127	4 x86 insns, which is not bad.</para>
				128
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	129	<para>Function <function>VG_(dispatch)</function> in
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	130	<filename>vg_dispatch.S</filename> is the heart of the JIT
				131	dispatcher. Once a translated code address has been found, it is
				132	executed simply by an x86 <computeroutput>call</computeroutput>
				133	to the translation. At the end of the translation, the next
				134	original code addr is loaded into
				135	<computeroutput>%eax</computeroutput>, and the translation then
				136	does a <computeroutput>ret</computeroutput>, taking it back to
				137	the dispatch loop, with, interestingly, zero branch
				138	mispredictions. The address requested in
				139	<computeroutput>%eax</computeroutput> is looked up first in
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	140	<function>VG_(tt_fast)</function>, and, if not found,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	141	by calling C helper
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	142	<function>VG_(search_transtab)</function>. If there
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	143	is still no translation available,
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	144	<function>VG_(dispatch)</function> exits back to the
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	145	top-level C dispatcher
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	146	<function>VG_(toploop)</function>, which arranges for
				147	<function>VG_(translate)</function> to make a new
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	148	translation. All fairly unsurprising, really. There are various
				149	complexities described below.</para>
				150
				151	<para>The translator, orchestrated by
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	152	<function>VG_(translate)</function>, is complicated
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	153	but entirely self-contained. It is described in great detail in
				154	subsequent sections. Translations are stored in TC, with TT
				155	tracking administrative information. The translations are
				156	subject to an approximate LRU-based management scheme. With the
				157	current settings, the TC can hold at most about 15MB of
				158	translations, and LRU passes prune it to about 13.5MB. Given
				159	that the orig-to-translation expansion ratio is about 13:1 to
				160	14:1, this means TC holds translations for more or less a
				161	megabyte of original code, which generally comes to about 70000
				162	basic blocks for C++ compiled with optimisation on. Generating
				163	new translations is expensive, so it is worth having a large TC
				164	to minimise the (capacity) miss rate.</para>
				165
				166	<para>The dispatcher,
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	167	<function>VG_(dispatch)</function>, receives hints
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	168	from the translations which allow it to cheaply spot all control
				169	transfers corresponding to x86
				170	<computeroutput>call</computeroutput> and
				171	<computeroutput>ret</computeroutput> instructions. It has to do
				172	this in order to spot some special events:</para>
				173
				174	<itemizedlist>
				175	<listitem>
				176	<para>Calls to
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	177	<function>VG_(shutdown)</function>. This is
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	178	Valgrind's cue to exit. NOTE: actually this is done a
				179	different way; it should be cleaned up.</para>
				180	</listitem>
				181
				182	<listitem>
				183	<para>Returns of system call handlers, to the return address
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	184	<function>VG_(signalreturn_bogusRA)</function>.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	185	The signal simulator needs to know when a signal handler is
				186	returning, so we spot jumps (returns) to this address.</para>
				187	</listitem>
				188
				189	<listitem>
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	190	<para>Calls to <function>vg_trap_here</function>.
				191	All <function>malloc</function>,
				192	<function>free</function>, etc calls that the
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	193	client program makes are eventually routed to a call to
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	194	<function>vg_trap_here</function>, and Valgrind
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	195	does its own special thing with these calls. In effect this
				196	provides a trapdoor, by which Valgrind can intercept certain
				197	calls on the simulated CPU, run the call as it sees fit
				198	itself (on the real CPU), and return the result to the
				199	simulated CPU, quite transparently to the client
				200	program.</para>
				201	</listitem>
				202
				203	</itemizedlist>
				204
				205	<para>Valgrind intercepts the client's
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	206	<function>malloc</function>,
				207	<function>free</function>, etc, calls, so that it can
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	208	store additional information. Each block
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	209	<function>malloc</function>'d by the client gives
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	210	rise to a shadow block in which Valgrind stores the call stack at
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	211	the time of the <function>malloc</function> call.
				212	When the client calls <function>free</function>,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	213	Valgrind tries to find the shadow block corresponding to the
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	214	address passed to <function>free</function>, and
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	215	emits an error message if none can be found. If it is found, the
				216	block is placed on the freed blocks queue
				217	<computeroutput>vg_freed_list</computeroutput>, it is marked as
				218	inaccessible, and its shadow block now records the call stack at
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	219	the time of the <function>free</function> call.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	220	Keeping <computeroutput>free</computeroutput>'d blocks in this
				221	queue allows Valgrind to spot all (presumably invalid) accesses
				222	to them. However, once the volume of blocks in the free queue
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	223	exceeds <function>VG_(clo_freelist_vol)</function>,
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	224	blocks are finally removed from the queue.</para>
				225
				226	<para>Keeping track of <literal>A</literal> and
				227	<literal>V</literal> bits (note: if you don't know what these
				228	are, you haven't read the user guide carefully enough) for memory
				229	is done in <filename>vg_memory.c</filename>. This implements a
				230	sparse array structure which covers the entire 4G address space
				231	in a way which is reasonably fast and reasonably space efficient.
				232	The 4G address space is divided up into 64K sections, each
				233	covering 64Kb of address space. Given a 32-bit address, the top
				234	16 bits are used to select one of the 65536 entries in
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	235	<function>VG_(primary_map)</function>. The resulting
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	236	"secondary" (<computeroutput>SecMap</computeroutput>) holds A and
				237	V bits for the 64k of address space chunk corresponding to the
				238	lower 16 bits of the address.</para>
				239
				240	</sect2>
				241
				242
				243
				244	<sect2 id="mc-tech-docs.design" xreflabel="Design decisions">
				245	<title>Design decisions</title>
				246
				247	<para>Some design decisions were motivated by the need to make
				248	Valgrind debuggable. Imagine you are writing a CPU simulator.
				249	It works fairly well. However, you run some large program, like
				250	Netscape, and after tens of millions of instructions, it crashes.
				251	How can you figure out where in your simulator the bug is?</para>
				252
				253	<para>Valgrind's answer is: cheat. Valgrind is designed so that
				254	it is possible to switch back to running the client program on
				255	the real CPU at any point. Using the
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	256	<option>--stop-after= </option> flag, you can ask
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	257	Valgrind to run just some number of basic blocks, and then run
				258	the rest of the way on the real CPU. If you are searching for a
				259	bug in the simulated CPU, you can use this to do a binary search,
				260	which quickly leads you to the specific basic block which is
				261	causing the problem.</para>
				262
				263	<para>This is all very handy. It does constrain the design in
				264	certain unimportant ways. Firstly, the layout of memory, when
				265	viewed from the client's point of view, must be identical
				266	regardless of whether it is running on the real or simulated CPU.
				267	This means that Valgrind can't do pointer swizzling -- well, no
				268	great loss -- and it can't run on the same stack as the client --
				269	again, no great loss. Valgrind operates on its own stack,
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	270	<function>VG_(stack)</function>, which it switches to
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	271	at startup, temporarily switching back to the client's stack when
				272	doing system calls for the client.</para>
				273
				274	<para>Valgrind also receives signals on its own stack,
				275	<computeroutput>VG_(sigstack)</computeroutput>, but for different
				276	gruesome reasons discussed below.</para>
				277
				278	<para>This nice clean
				279	switch-back-to-the-real-CPU-whenever-you-like story is muddied by
				280	signals. Problem is that signals arrive at arbitrary times and
				281	tend to slightly perturb the basic block count, with the result
				282	that you can get close to the basic block causing a problem but
				283	can't home in on it exactly. My kludgey hack is to define
				284	<computeroutput>SIGNAL_SIMULATION</computeroutput> to 1 towards
				285	the bottom of <filename>vg_syscall_mem.c</filename>, so that
				286	signal handlers are run on the real CPU and don't change the BB
				287	counts.</para>
				288
				289	<para>A second hole in the switch-back-to-real-CPU story is that
				290	Valgrind's way of delivering signals to the client is different
				291	from that of the kernel. Specifically, the layout of the signal
				292	delivery frame, and the mechanism used to detect a sighandler
				293	returning, are different. So you can't expect to make the
				294	transition inside a sighandler and still have things working, but
				295	in practice that's not much of a restriction.</para>
				296
				297	<para>Valgrind's implementation of
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	298	<function>malloc</function>,
				299	<function>free</function>, etc, (in
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	300	<filename>vg_clientmalloc.c</filename>, not the low-level stuff
				301	in <filename>vg_malloc2.c</filename>) is somewhat complicated by
				302	the need to handle switching back at arbitrary points. It does
				303	work tho.</para>
				304
				305	</sect2>
				306
				307
				308
				309	<sect2 id="mc-tech-docs.correctness" xreflabel="Correctness">
				310	<title>Correctness</title>
				311
				312	<para>There's only one of me, and I have a Real Life (tm) as well
				313	as hacking Valgrind [allegedly :-]. That means I don't have time
				314	to waste chasing endless bugs in Valgrind. My emphasis is
				315	therefore on doing everything as simply as possible, with
				316	correctness, stability and robustness being the number one
				317	priority, more important than performance or functionality. As a
				318	result:</para>
				319
				320	<itemizedlist>
				321
				322	<listitem>
				323	<para>The code is absolutely loaded with assertions, and
				324	these are <command>permanently enabled.</command> I have no
				325	plan to remove or disable them later. Over the past couple
				326	of months, as valgrind has become more widely used, they have
				327	shown their worth, pulling up various bugs which would
				328	otherwise have appeared as hard-to-find segmentation
				329	faults.</para>
				330
				331	<para>I am of the view that it's acceptable to spend 5% of
				332	the total running time of your valgrindified program doing
				333	assertion checks and other internal sanity checks.</para>
				334	</listitem>
				335
				336	<listitem>
				337	<para>Aside from the assertions, valgrind contains various
				338	sets of internal sanity checks, which get run at varying
				339	frequencies during normal operation.
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	340	<function>VG_(do_sanity_checks)</function> runs
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	341	every 1000 basic blocks, which means 500 to 2000 times/second
				342	for typical machines at present. It checks that Valgrind
				343	hasn't overrun its private stack, and does some simple checks
				344	on the memory permissions maps. Once every 25 calls it does
				345	some more extensive checks on those maps. Etc, etc.</para>
				346	<para>The following components also have sanity check code,
				347	which can be enabled to aid debugging:</para>
				348	<itemizedlist>
				349	<listitem><para>The low-level memory-manager
				350	(<computeroutput>VG_(mallocSanityCheckArena)</computeroutput>).
				351	This does a complete check of all blocks and chains in an
				352	arena, which is very slow. Is not engaged by default.</para>
				353	</listitem>
				354
				355	<listitem>
				356	<para>The symbol table reader(s): various checks to
				357	ensure uniqueness of mappings; see
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	358	<function>VG_(read_symbols)</function> for a
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	359	start. Is permanently engaged.</para>
				360	</listitem>
				361
				362	<listitem>
				363	<para>The A and V bit tracking stuff in
				364	<filename>vg_memory.c</filename>. This can be compiled
				365	with cpp symbol
				366	<computeroutput>VG_DEBUG_MEMORY</computeroutput> defined,
				367	which removes all the fast, optimised cases, and uses
				368	simple-but-slow fallbacks instead. Not engaged by
				369	default.</para>
				370	</listitem>
				371
				372	<listitem>
				373	<para>Ditto
				374	<computeroutput>VG_DEBUG_LEAKCHECK</computeroutput>.</para>
				375	</listitem>
				376
				377	<listitem>
				378	<para>The JITter parses x86 basic blocks into sequences
				379	of UCode instructions. It then sanity checks each one
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	380	with <function>VG_(saneUInstr)</function> and
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	381	sanity checks the sequence as a whole with
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	382	<function>VG_(saneUCodeBlock)</function>.
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	383	This stuff is engaged by default, and has caught some
				384	way-obscure bugs in the simulated CPU machinery in its
				385	time.</para>
				386	</listitem>
				387
				388	<listitem>
				389	<para>The system call wrapper does
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	390	<function>VG_(first_and_last_secondaries_look_plausible)</function>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	391	after every syscall; this is known to pick up bugs in the
				392	syscall wrappers. Engaged by default.</para>
				393	</listitem>
				394
				395	<listitem>
				396	<para>The main dispatch loop, in
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	397	<function>VG_(dispatch)</function>, checks
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	398	that translations do not set
				399	<computeroutput>%ebp</computeroutput> to any value
				400	different from
				401	<computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput>
				402	or <computeroutput>& VG_(baseBlock)</computeroutput>.
				403	In effect this test is free, and is permanently
				404	engaged.</para>
				405	</listitem>
				406
				407	<listitem>
				408	<para>There are a couple of ifdefed-out consistency
				409	checks I inserted whilst debugging the new register
				410	allocater,
				411	<computeroutput>vg_do_register_allocation</computeroutput>.</para>
				412	</listitem>
				413	</itemizedlist>
				414	</listitem>
				415
				416	<listitem>
				417	<para>I try to avoid techniques, algorithms, mechanisms, etc,
				418	for which I can supply neither a convincing argument that
				419	they are correct, nor sanity-check code which might pick up
				420	bugs in my implementation. I don't always succeed in this,
				421	but I try. Basically the idea is: avoid techniques which
				422	are, in practice, unverifiable, in some sense. When doing
				423	anything, always have in mind: "how can I verify that this is
				424	correct?"</para>
				425	</listitem>
				426
				427	</itemizedlist>
				428
				429
				430	<para>Some more specific things are:</para>
				431	<itemizedlist>
				432	<listitem>
				433	<para>Valgrind runs in the same namespace as the client, at
				434	least from <filename>ld.so</filename>'s point of view, and it
				435	therefore absolutely had better not export any symbol with a
				436	name which could clash with that of the client or any of its
				437	libraries. Therefore, all globally visible symbols exported
				438	from <filename>valgrind.so</filename> are defined using the
				439	<computeroutput>VG_</computeroutput> CPP macro. As you'll
				440	see from <filename>vg_constants.h</filename>, this appends
				441	some arbitrary prefix to the symbol, in order that it be, we
				442	hope, globally unique. Currently the prefix is
				443	<computeroutput>vgPlain_</computeroutput>. For convenience
				444	there are also <computeroutput>VGM_</computeroutput>,
				445	<computeroutput>VGP_</computeroutput> and
				446	<computeroutput>VGOFF_</computeroutput>. All locally defined
				447	symbols are declared <computeroutput>static</computeroutput>
				448	and do not appear in the final shared object.</para>
				449
				450	<para>To check this, I periodically do <computeroutput>nm
				451	valgrind.so \| grep " T "</computeroutput>, which shows you
				452	all the globally exported text symbols. They should all have
				453	an approved prefix, except for those like
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	454	<function>malloc</function>,
				455	<function>free</function>, etc, which we
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	456	deliberately want to shadow and take precedence over the same
				457	names exported from <filename>glibc.so</filename>, so that
				458	valgrind can intercept those calls easily. Similarly,
				459	<computeroutput>nm valgrind.so \| grep " D "</computeroutput>
				460	allows you to find any rogue data-segment symbol
				461	names.</para>
				462	</listitem>
				463
				464	<listitem>
				465	<para>Valgrind tries, and almost succeeds, in being
				466	completely independent of all other shared objects, in
				467	particular of <filename>glibc.so</filename>. For example, we
				468	have our own low-level memory manager in
				469	<filename>vg_malloc2.c</filename>, which is a fairly standard
				470	malloc/free scheme augmented with arenas, and
				471	<filename>vg_mylibc.c</filename> exports reimplementations of
				472	various bits and pieces you'd normally get from the C
				473	library.</para>
				474
				475	<para>Why all the hassle? Because imagine the potential
				476	chaos of both the simulated and real CPUs executing in
				477	<filename>glibc.so</filename>. It just seems simpler and
				478	cleaner to be completely self-contained, so that only the
				479	simulated CPU visits <filename>glibc.so</filename>. In
				480	practice it's not much hassle anyway. Also, valgrind starts
				481	up before glibc has a chance to initialise itself, and who
				482	knows what difficulties that could lead to. Finally, glibc
				483	has definitions for some types, specifically
				484	<computeroutput>sigset_t</computeroutput>, which conflict
				485	(are different from) the Linux kernel's idea of same. When
				486	Valgrind wants to fiddle around with signal stuff, it wants
				487	to use the kernel's definitions, not glibc's definitions. So
				488	it's simplest just to keep glibc out of the picture
				489	entirely.</para>
				490
				491	<para>To find out which glibc symbols are used by Valgrind,
				492	reinstate the link flags <computeroutput>-nostdlib
				493	-Wl,-no-undefined</computeroutput>. This causes linking to
				494	fail, but will tell you what you depend on. I have mostly,
				495	but not entirely, got rid of the glibc dependencies; what
				496	remains is, IMO, fairly harmless. AFAIK the current
				497	dependencies are: <computeroutput>memset</computeroutput>,
				498	<computeroutput>memcmp</computeroutput>,
				499	<computeroutput>stat</computeroutput>,
				500	<computeroutput>system</computeroutput>,
				501	<computeroutput>sbrk</computeroutput>,
				502	<computeroutput>setjmp</computeroutput> and
				503	<computeroutput>longjmp</computeroutput>.</para>
				504	</listitem>
				505
				506	<listitem>
				507	<para>Similarly, valgrind should not really import any
				508	headers other than the Linux kernel headers, since it knows
				509	of no API other than the kernel interface to talk to. At the
				510	moment this is really not in a good state, and
				511	<computeroutput>vg_syscall_mem</computeroutput> imports, via
				512	<filename>vg_unsafe.h</filename>, a significant number of
				513	C-library headers so as to know the sizes of various structs
				514	passed across the kernel boundary. This is of course
				515	completely bogus, since there is no guarantee that the C
				516	library's definitions of these structs matches those of the
				517	kernel. I have started to sort this out using
				518	<filename>vg_kerneliface.h</filename>, into which I had
				519	intended to copy all kernel definitions which valgrind could
				520	need, but this has not gotten very far. At the moment it
				521	mostly contains definitions for
				522	<computeroutput>sigset_t</computeroutput> and
				523	<computeroutput>struct sigaction</computeroutput>, since the
				524	kernel's definition for these really does clash with glibc's.
				525	I plan to use a <computeroutput>vki_</computeroutput> prefix
				526	on all these types and constants, to denote the fact that
				527	they pertain to <command>V</command>algrind's
				528	<command>K</command>ernel
				529	<command>I</command>nterface.</para>
				530
				531	<para>Another advantage of having a
				532	<filename>vg_kerneliface.h</filename> file is that it makes
				533	it simpler to interface to a different kernel. Once can, for
				534	example, easily imagine writing a new
				535	<filename>vg_kerneliface.h</filename> for FreeBSD, or x86
				536	NetBSD.</para>
				537	</listitem>
				538
				539	</itemizedlist>
				540
				541	</sect2>
				542
				543
				544
				545	<sect2 id="mc-tech-docs.limits" xreflabel="Current limitations">
				546	<title>Current limitations</title>
				547
				548	<para>Support for weird (non-POSIX) signal stuff is patchy. Does
				549	anybody care?</para>
				550
				551	</sect2>
				552
				553	</sect1>
				554
				555
				556
				557
				558
				559	<sect1 id="mc-tech-docs.jitter" xreflabel="The instrumenting JITter">
				560	<title>The instrumenting JITter</title>
				561
				562	<para>This really is the heart of the matter. We begin with
				563	various side issues.</para>
				564
				565
				566	<sect2 id="mc-tech-docs.storage"
				567	xreflabel="Run-time storage, and the use of host registers">
				568	<title>Run-time storage, and the use of host registers</title>
				569
				570	<para>Valgrind translates client (original) basic blocks into
				571	instrumented basic blocks, which live in the translation cache
				572	TC, until either the client finishes or the translations are
				573	ejected from TC to make room for newer ones.</para>
				574
				575	<para>Since it generates x86 code in memory, Valgrind has
				576	complete control of the use of registers in the translations.
				577	Now pay attention. I shall say this only once, and it is
				578	important you understand this. In what follows I will refer to
				579	registers in the host (real) cpu using their standard names,
				580	<computeroutput>%eax</computeroutput>,
				581	<computeroutput>%edi</computeroutput>, etc. I refer to registers
				582	in the simulated CPU by capitalising them:
				583	<computeroutput>%EAX</computeroutput>,
				584	<computeroutput>%EDI</computeroutput>, etc. These two sets of
				585	registers usually bear no direct relationship to each other;
				586	there is no fixed mapping between them. This naming scheme is
				587	used fairly consistently in the comments in the sources.</para>
				588
				589	<para>Host registers, once things are up and running, are used as
				590	follows:</para>
				591
				592	<itemizedlist>
				593	<listitem>
				594	<para><computeroutput>%esp</computeroutput>, the real stack
				595	pointer, points somewhere in Valgrind's private stack area,
				596	<computeroutput>VG_(stack)</computeroutput> or, transiently,
				597	into its signal delivery stack,
				598	<computeroutput>VG_(sigstack)</computeroutput>.</para>
				599	</listitem>
				600
				601	<listitem>
				602	<para><computeroutput>%edi</computeroutput> is used as a
				603	temporary in code generation; it is almost always dead,
				604	except when used for the
				605	<computeroutput>Left</computeroutput> value-tag operations.</para>
				606	</listitem>
				607
				608	<listitem>
				609	<para><computeroutput>%eax</computeroutput>,
				610	<computeroutput>%ebx</computeroutput>,
				611	<computeroutput>%ecx</computeroutput>,
				612	<computeroutput>%edx</computeroutput> and
				613	<computeroutput>%esi</computeroutput> are available to
				614	Valgrind's register allocator. They are dead (carry
				615	unimportant values) in between translations, and are live
				616	only in translations. The one exception to this is
				617	<computeroutput>%eax</computeroutput>, which, as mentioned
				618	far above, has a special significance to the dispatch loop
				619	<computeroutput>VG_(dispatch)</computeroutput>: when a
				620	translation returns to the dispatch loop,
				621	<computeroutput>%eax</computeroutput> is expected to contain
				622	the original-code-address of the next translation to run.
				623	The register allocator is so good at minimising spill code
				624	that using five regs and not having to save/restore
				625	<computeroutput>%edi</computeroutput> actually gives better
				626	code than allocating to <computeroutput>%edi</computeroutput>
				627	as well, but then having to push/pop it around special
				628	uses.</para>
				629	</listitem>
				630
				631	<listitem>
				632	<para><computeroutput>%ebp</computeroutput> points
				633	permanently at
				634	<computeroutput>VG_(baseBlock)</computeroutput>. Valgrind's
				635	translations are position-independent, partly because this is
				636	convenient, but also because translations get moved around in
				637	TC as part of the LRUing activity. <command>All</command>
				638	static entities which need to be referred to from generated
				639	code, whether data or helper functions, are stored starting
				640	at <computeroutput>VG_(baseBlock)</computeroutput> and are
				641	therefore reached by indexing from
				642	<computeroutput>%ebp</computeroutput>. There is but one
				643	exception, which is that by placing the value
				644	<computeroutput>VG_EBP_DISPATCH_CHECKED</computeroutput> in
				645	<computeroutput>%ebp</computeroutput> just before a return to
				646	the dispatcher, the dispatcher is informed that the next
				647	address to run, in <computeroutput>%eax</computeroutput>,
				648	requires special treatment.</para>
				649	</listitem>
				650
				651	<listitem>
				652	<para>The real machine's FPU state is pretty much
				653	unimportant, for reasons which will become obvious. Ditto
				654	its <computeroutput>%eflags</computeroutput> register.</para>
				655	</listitem>
				656
				657	</itemizedlist>
				658
				659	<para>The state of the simulated CPU is stored in memory, in
				660	<computeroutput>VG_(baseBlock)</computeroutput>, which is a block
				661	of 200 words IIRC. Recall that
				662	<computeroutput>%ebp</computeroutput> points permanently at the
				663	start of this block. Function
				664	<computeroutput>vg_init_baseBlock</computeroutput> decides what
				665	the offsets of various entities in
				666	<computeroutput>VG_(baseBlock)</computeroutput> are to be, and
				667	allocates word offsets for them. The code generator then emits
				668	<computeroutput>%ebp</computeroutput> relative addresses to get
				669	at those things. The sequence in which entities are allocated
				670	has been carefully chosen so that the 32 most popular entities
				671	come first, because this means 8-bit offsets can be used in the
				672	generated code.</para>
				673
				674	<para>If I was clever, I could make
				675	<computeroutput>%ebp</computeroutput> point 32 words along
				676	<computeroutput>VG_(baseBlock)</computeroutput>, so that I'd have
				677	another 32 words of short-form offsets available, but that's just
				678	complicated, and it's not important -- the first 32 words take
				679	99% (or whatever) of the traffic.</para>
				680
				681	<para>Currently, the sequence of stuff in
				682	<computeroutput>VG_(baseBlock)</computeroutput> is as
				683	follows:</para>
				684
				685	<itemizedlist>
				686	<listitem>
				687	<para>9 words, holding the simulated integer registers,
				688	<computeroutput>%EAX</computeroutput>
				689	.. <computeroutput>%EDI</computeroutput>, and the simulated
				690	flags, <computeroutput>%EFLAGS</computeroutput>.</para>
				691	</listitem>
				692
				693	<listitem>
				694	<para>Another 9 words, holding the V bit "shadows" for the
				695	above 9 regs.</para>
				696	</listitem>
				697
				698	<listitem>
				699	<para>The <command>addresses</command> of various helper
				700	routines called from generated code:
				701	<computeroutput>VG_(helper_value_check4_fail)</computeroutput>,
				702	<computeroutput>VG_(helper_value_check0_fail)</computeroutput>,
				703	which register V-check failures,
				704	<computeroutput>VG_(helperc_STOREV4)</computeroutput>,
				705	<computeroutput>VG_(helperc_STOREV1)</computeroutput>,
				706	<computeroutput>VG_(helperc_LOADV4)</computeroutput>,
				707	<computeroutput>VG_(helperc_LOADV1)</computeroutput>, which
				708	do stores and loads of V bits to/from the sparse array which
				709	keeps track of V bits in memory, and
				710	<computeroutput>VGM_(handle_esp_assignment)</computeroutput>,
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	711	which messes with memory addressability resulting from
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	712	changes in <computeroutput>%ESP</computeroutput>.</para>
				713	</listitem>
				714
				715	<listitem>
				716	<para>The simulated <computeroutput>%EIP</computeroutput>.</para>
				717	</listitem>
				718
				719	<listitem>
				720	<para>24 spill words, for when the register allocator can't
				721	make it work with 5 measly registers.</para>
				722	</listitem>
				723
				724	<listitem>
				725	<para>Addresses of helpers
				726	<computeroutput>VG_(helperc_STOREV2)</computeroutput>,
				727	<computeroutput>VG_(helperc_LOADV2)</computeroutput>. These
				728	are here because 2-byte loads and stores are relatively rare,
				729	so are placed above the magic 32-word offset boundary.</para>
				730	</listitem>
				731
				732	<listitem>
				733	<para>For similar reasons, addresses of helper functions
				734	<computeroutput>VGM_(fpu_write_check)</computeroutput> and
				735	<computeroutput>VGM_(fpu_read_check)</computeroutput>, which
				736	handle the A/V maps testing and changes required by FPU
				737	writes/reads.</para>
				738	</listitem>
				739
				740	<listitem>
				741	<para>Some other boring helper addresses:
				742	<computeroutput>VG_(helper_value_check2_fail)</computeroutput>
				743	and
				744	<computeroutput>VG_(helper_value_check1_fail)</computeroutput>.
				745	These are probably never emitted now, and should be
				746	removed.</para>
				747	</listitem>
				748
				749	<listitem>
				750	<para>The entire state of the simulated FPU, which I believe
				751	to be 108 bytes long.</para>
				752	</listitem>
				753
				754	<listitem>
				755	<para>Finally, the addresses of various other helper
				756	functions in <filename>vg_helpers.S</filename>, which deal
				757	with rare situations which are tedious or difficult to
				758	generate code in-line for.</para>
				759	</listitem>
				760
				761	</itemizedlist>
				762
				763	<para>As a general rule, the simulated machine's state lives
				764	permanently in memory at
				765	<computeroutput>VG_(baseBlock)</computeroutput>. However, the
				766	JITter does some optimisations which allow the simulated integer
				767	registers to be cached in real registers over multiple simulated
				768	instructions within the same basic block. These are always
				769	flushed back into memory at the end of every basic block, so that
				770	the in-memory state is up-to-date between basic blocks. (This
				771	flushing is implied by the statement above that the real
				772	machine's allocatable registers are dead in between simulated
				773	blocks).</para>
				774
				775	</sect2>
				776
				777
				778
				779	<sect2 id="mc-tech-docs.startup"
				780	xreflabel="Startup, shutdown, and system calls">
				781	<title>Startup, shutdown, and system calls</title>
				782
				783	<para>Getting into of Valgrind
				784	(<computeroutput>VG_(startup)</computeroutput>, called from
				785	<filename>valgrind.so</filename>'s initialisation section),
				786	really means copying the real CPU's state into
				787	<computeroutput>VG_(baseBlock)</computeroutput>, and then
				788	installing our own stack pointer, etc, into the real CPU, and
				789	then starting up the JITter. Exiting valgrind involves copying
				790	the simulated state back to the real state.</para>
				791
				792	<para>Unfortunately, there's a complication at startup time.
				793	Problem is that at the point where we need to take a snapshot of
				794	the real CPU's state, the offsets in
				795	<computeroutput>VG_(baseBlock)</computeroutput> are not set up
				796	yet, because to do so would involve disrupting the real machine's
				797	state significantly. The way round this is to dump the real
				798	machine's state into a temporary, static block of memory,
				799	<computeroutput>VG_(m_state_static)</computeroutput>. We can
				800	then set up the <computeroutput>VG_(baseBlock)</computeroutput>
				801	offsets at our leisure, and copy into it from
				802	<computeroutput>VG_(m_state_static)</computeroutput> at some
				803	convenient later time. This copying is done by
				804	<computeroutput>VG_(copy_m_state_static_to_baseBlock)</computeroutput>.</para>
				805
				806	<para>On exit, the inverse transformation is (rather
				807	unnecessarily) used: stuff in
				808	<computeroutput>VG_(baseBlock)</computeroutput> is copied to
				809	<computeroutput>VG_(m_state_static)</computeroutput>, and the
				810	assembly stub then copies from
				811	<computeroutput>VG_(m_state_static)</computeroutput> into the
				812	real machine registers.</para>
				813
				814	<para>Doing system calls on behalf of the client
				815	(<filename>vg_syscall.S</filename>) is something of a half-way
				816	house. We have to make the world look sufficiently like that
				817	which the client would normally have to make the syscall actually
				818	work properly, but we can't afford to lose control. So the trick
				819	is to copy all of the client's state, <command>except its program
				820	counter</command>, into the real CPU, do the system call, and
				821	copy the state back out. Note that the client's state includes
				822	its stack pointer register, so one effect of this partial
				823	restoration is to cause the system call to be run on the client's
				824	stack, as it should be.</para>
				825
				826	<para>As ever there are complications. We have to save some of
				827	our own state somewhere when restoring the client's state into
				828	the CPU, so that we can keep going sensibly afterwards. In fact
				829	the only thing which is important is our own stack pointer, but
				830	for paranoia reasons I save and restore our own FPU state as
				831	well, even though that's probably pointless.</para>
				832
				833	<para>The complication on the above complication is, that for
				834	horrible reasons to do with signals, we may have to handle a
				835	second client system call whilst the client is blocked inside
				836	some other system call (unbelievable!). That means there's two
				837	sets of places to dump Valgrind's stack pointer and FPU state
				838	across the syscall, and we decide which to use by consulting
				839	<computeroutput>VG_(syscall_depth)</computeroutput>, which is in
				840	turn maintained by
				841	<computeroutput>VG_(wrap_syscall)</computeroutput>.</para>
				842
				843	</sect2>
				844
				845
				846
				847	<sect2 id="mc-tech-docs.ucode" xreflabel="Introduction to UCode">
				848	<title>Introduction to UCode</title>
				849
				850	<para>UCode lies at the heart of the x86-to-x86 JITter. The
				851	basic premise is that dealing the the x86 instruction set head-on
				852	is just too darn complicated, so we do the traditional
				853	compiler-writer's trick and translate it into a simpler,
				854	easier-to-deal-with form.</para>
				855
				856	<para>In normal operation, translation proceeds through six
				857	stages, coordinated by
				858	<computeroutput>VG_(translate)</computeroutput>:</para>
				859
				860	<orderedlist>
				861	<listitem>
				862	<para>Parsing of an x86 basic block into a sequence of UCode
				863	instructions (<computeroutput>VG_(disBB)</computeroutput>).</para>
				864	</listitem>
				865
				866	<listitem>
				867	<para>UCode optimisation
				868	(<computeroutput>vg_improve</computeroutput>), with the aim
				869	of caching simulated registers in real registers over
				870	multiple simulated instructions, and removing redundant
				871	simulated <computeroutput>%EFLAGS</computeroutput>
				872	saving/restoring.</para>
				873	</listitem>
				874
				875	<listitem>
				876	<para>UCode instrumentation
				877	(<computeroutput>vg_instrument</computeroutput>), which adds
				878	value and address checking code.</para>
				879	</listitem>
				880
				881	<listitem>
				882	<para>Post-instrumentation cleanup
				883	(<computeroutput>vg_cleanup</computeroutput>), removing
				884	redundant value-check computations.</para>
				885	</listitem>
				886
				887	<listitem>
				888	<para>Register allocation
				889	(<computeroutput>vg_do_register_allocation</computeroutput>),
				890	which, note, is done on UCode.</para>
				891	</listitem>
				892
				893	<listitem>
				894	<para>Emission of final instrumented x86 code
				895	(<computeroutput>VG_(emit_code)</computeroutput>).</para>
				896	</listitem>
				897
				898	</orderedlist>
				899
				900	<para>Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
				901	transformation passes, all on straight-line blocks of UCode (type
				902	<computeroutput>UCodeBlock</computeroutput>). Steps 2 and 4 are
				903	optimisation passes and can be disabled for debugging purposes,
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	904	with <option>--optimise=no</option> and
				905	<option>--cleanup=no</option> respectively.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	906
				907	<para>Valgrind can also run in a no-instrumentation mode, given
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	908	<option>--instrument=no</option>. This is useful
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	909	for debugging the JITter quickly without having to deal with the
				910	complexity of the instrumentation mechanism too. In this mode,
				911	steps 3 and 4 are omitted.</para>
				912
				913	<para>These flags combine, so that
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	914	<option>--instrument=no</option> together with
				915	<option>--optimise=no</option> means only steps
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	916	1, 5 and 6 are used.
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	917	<option>--single-step=yes</option> causes each
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	918	x86 instruction to be treated as a single basic block. The
				919	translations are terrible but this is sometimes instructive.</para>
				920
de	03e0e7c	2005-12-03 23:02:33 +0000	[diff] [blame]	921	<para>The <option>--stop-after=N</option> flag
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	922	switches back to the real CPU after
				923	<computeroutput>N</computeroutput> basic blocks. It also re-JITs
				924	the final basic block executed and prints the debugging info
				925	resulting, so this gives you a way to get a quick snapshot of how
				926	a basic block looks as it passes through the six stages mentioned
				927	above. If you want to see full information for every block
				928	translated (probably not, but still ...) find, in
				929	<computeroutput>VG_(translate)</computeroutput>, the lines</para>
				930	<programlisting><![CDATA[
				931	dis = True;
				932	dis = debugging_translation;]]></programlisting>
				933
				934	<para>and comment out the second line. This will spew out
				935	debugging junk faster than you can possibly imagine.</para>
				936
				937	</sect2>
				938
				939
				940
				941	<sect2 id="mc-tech-docs.tags" xreflabel="UCode operand tags: type 'Tag'">
				942	<title>UCode operand tags: type <computeroutput>Tag</computeroutput></title>
				943
				944	<para>UCode is, more or less, a simple two-address RISC-like
				945	code. In keeping with the x86 AT&T assembly syntax,
				946	generally speaking the first operand is the source operand, and
				947	the second is the destination operand, which is modified when the
				948	uinstr is notionally executed.</para>
				949
				950	<para>UCode instructions have up to three operand fields, each of
				951	which has a corresponding <computeroutput>Tag</computeroutput>
				952	describing it. Possible values for the tag are:</para>
				953
				954	<itemizedlist>
				955
				956	<listitem>
				957	<para><computeroutput>NoValue</computeroutput>: indicates
				958	that the field is not in use.</para>
				959	</listitem>
				960
				961	<listitem>
				962	<para><computeroutput>Lit16</computeroutput>: the field
				963	contains a 16-bit literal.</para>
				964	</listitem>
				965
				966	<listitem>
				967	<para><computeroutput>Literal</computeroutput>: the field
				968	denotes a 32-bit literal, whose value is stored in the
				969	<computeroutput>lit32</computeroutput> field of the uinstr
				970	itself. Since there is only one
				971	<computeroutput>lit32</computeroutput> for the whole uinstr,
				972	only one operand field may contain this tag.</para>
				973	</listitem>
				974
				975	<listitem>
				976	<para><computeroutput>SpillNo</computeroutput>: the field
				977	contains a spill slot number, in the range 0 to 23 inclusive,
				978	denoting one of the spill slots contained inside
				979	<computeroutput>VG_(baseBlock)</computeroutput>. Such tags
				980	only exist after register allocation.</para>
				981	</listitem>
				982
				983	<listitem>
				984	<para><computeroutput>RealReg</computeroutput>: the field
				985	contains a number in the range 0 to 7 denoting an integer x86
				986	("real") register on the host. The number is the Intel
				987	encoding for integer registers. Such tags only exist after
				988	register allocation.</para>
				989	</listitem>
				990
				991	<listitem>
				992	<para><computeroutput>ArchReg</computeroutput>: the field
				993	contains a number in the range 0 to 7 denoting an integer x86
				994	register on the simulated CPU. In reality this means a
				995	reference to one of the first 8 words of
				996	<computeroutput>VG_(baseBlock)</computeroutput>. Such tags
				997	can exist at any point in the translation process.</para>
				998	</listitem>
				999
				1000	<listitem>
				1001	<para>Last, but not least,
				1002	<computeroutput>TempReg</computeroutput>. The field contains
				1003	the number of one of an infinite set of virtual (integer)
				1004	registers. <computeroutput>TempReg</computeroutput>s are used
				1005	everywhere throughout the translation process; you can have
				1006	as many as you want. The register allocator maps as many as
				1007	it can into <computeroutput>RealReg</computeroutput>s and
				1008	turns the rest into
				1009	<computeroutput>SpillNo</computeroutput>s, so
				1010	<computeroutput>TempReg</computeroutput>s should not exist
				1011	after the register allocation phase.</para>
				1012
				1013	<para><computeroutput>TempReg</computeroutput>s are always 32
				1014	bits long, even if the data they hold is logically shorter.
				1015	In that case the upper unused bits are required, and, I
				1016	think, generally assumed, to be zero.
				1017	<computeroutput>TempReg</computeroutput>s holding V bits for
				1018	quantities shorter than 32 bits are expected to have ones in
				1019	the unused places, since a one denotes "undefined".</para>
				1020	</listitem>
				1021
				1022	</itemizedlist>
				1023
				1024	</sect2>
				1025
				1026
				1027
				1028	<sect2 id="mc-tech-docs.uinstr"
				1029	xreflabel="UCode instructions: type 'UInstr'">
				1030	<title>UCode instructions: type <computeroutput>UInstr</computeroutput></title>
				1031
				1032	<para>UCode was carefully designed to make it possible to do
				1033	register allocation on UCode and then translate the result into
				1034	x86 code without needing any extra registers ... well, that was
				1035	the original plan, anyway. Things have gotten a little more
				1036	complicated since then. In what follows, UCode instructions are
				1037	referred to as uinstrs, to distinguish them from x86
				1038	instructions. Uinstrs of course have uopcodes which are
				1039	(naturally) different from x86 opcodes.</para>
				1040
				1041	<para>A uinstr (type <computeroutput>UInstr</computeroutput>)
				1042	contains various fields, not all of which are used by any one
				1043	uopcode:</para>
				1044
				1045	<itemizedlist>
				1046
				1047	<listitem>
				1048	<para>Three 16-bit operand fields,
				1049	<computeroutput>val1</computeroutput>,
				1050	<computeroutput>val2</computeroutput> and
				1051	<computeroutput>val3</computeroutput>.</para>
				1052	</listitem>
				1053
				1054	<listitem>
				1055	<para>Three tag fields,
				1056	<computeroutput>tag1</computeroutput>,
				1057	<computeroutput>tag2</computeroutput> and
				1058	<computeroutput>tag3</computeroutput>. Each of these has a
				1059	value of type <computeroutput>Tag</computeroutput>, and they
				1060	describe what the <computeroutput>val1</computeroutput>,
				1061	<computeroutput>val2</computeroutput> and
				1062	<computeroutput>val3</computeroutput> fields contain.</para>
				1063	</listitem>
				1064
				1065	<listitem>
				1066	<para>A 32-bit literal field.</para>
				1067	</listitem>
				1068
				1069	<listitem>
				1070	<para>Two <computeroutput>FlagSet</computeroutput>s,
				1071	specifying which x86 condition codes are read and written by
				1072	the uinstr.</para>
				1073	</listitem>
				1074
				1075	<listitem>
				1076	<para>An opcode byte, containing a value of type
				1077	<computeroutput>Opcode</computeroutput>.</para>
				1078	</listitem>
				1079
				1080	<listitem>
				1081	<para>A size field, indicating the data transfer size
				1082	(1/2/4/8/10) in cases where this makes sense, or zero
				1083	otherwise.</para>
				1084	</listitem>
				1085
				1086	<listitem>
				1087	<para>A condition-code field, which, for jumps, holds a value
				1088	of type <computeroutput>Condcode</computeroutput>, indicating
				1089	the condition which applies. The encoding is as it is in the
				1090	x86 insn stream, except we add a 17th value
				1091	<computeroutput>CondAlways</computeroutput> to indicate an
				1092	unconditional transfer.</para>
				1093	</listitem>
				1094
				1095	<listitem>
				1096	<para>Various 1-bit flags, indicating whether this insn
				1097	pertains to an x86 CALL or RET instruction, whether a
				1098	widening is signed or not, etc.</para>
				1099	</listitem>
				1100
				1101	</itemizedlist>
				1102
				1103	<para>UOpcodes (type <computeroutput>Opcode</computeroutput>) are
				1104	divided into two groups: those necessary merely to express the
				1105	functionality of the x86 code, and extra uopcodes needed to
				1106	express the instrumentation. The former group contains:</para>
				1107
				1108	<itemizedlist>
				1109
				1110	<listitem>
				1111	<para><computeroutput>GET</computeroutput> and
				1112	<computeroutput>PUT</computeroutput>, which move values from
				1113	the simulated CPU's integer registers
				1114	(<computeroutput>ArchReg</computeroutput>s) into
				1115	<computeroutput>TempReg</computeroutput>s, and back.
				1116	<computeroutput>GETF</computeroutput> and
				1117	<computeroutput>PUTF</computeroutput> do the corresponding
				1118	thing for the simulated
				1119	<computeroutput>%EFLAGS</computeroutput>. There are no
				1120	corresponding insns for the FPU register stack, since we
				1121	don't explicitly simulate its registers.</para>
				1122	</listitem>
				1123
				1124	<listitem>
				1125	<para><computeroutput>LOAD</computeroutput> and
				1126	<computeroutput>STORE</computeroutput>, which, in RISC-like
				1127	fashion, are the only uinstrs able to interact with
				1128	memory.</para>
				1129	</listitem>
				1130
				1131	<listitem>
				1132	<para><computeroutput>MOV</computeroutput> and
				1133	<computeroutput>CMOV</computeroutput> allow unconditional and
				1134	conditional moves of values between
				1135	<computeroutput>TempReg</computeroutput>s.</para>
				1136	</listitem>
				1137
				1138	<listitem>
				1139	<para>ALU operations. Again in RISC-like fashion, these only
				1140	operate on <computeroutput>TempReg</computeroutput>s (before
				1141	reg-alloc) or <computeroutput>RealReg</computeroutput>s
				1142	(after reg-alloc). These are:
				1143	<computeroutput>ADD</computeroutput>,
				1144	<computeroutput>ADC</computeroutput>,
				1145	<computeroutput>AND</computeroutput>,
				1146	<computeroutput>OR</computeroutput>,
				1147	<computeroutput>XOR</computeroutput>,
				1148	<computeroutput>SUB</computeroutput>,
				1149	<computeroutput>SBB</computeroutput>,
				1150	<computeroutput>SHL</computeroutput>,
				1151	<computeroutput>SHR</computeroutput>,
				1152	<computeroutput>SAR</computeroutput>,
				1153	<computeroutput>ROL</computeroutput>,
				1154	<computeroutput>ROR</computeroutput>,
				1155	<computeroutput>RCL</computeroutput>,
				1156	<computeroutput>RCR</computeroutput>,
				1157	<computeroutput>NOT</computeroutput>,
				1158	<computeroutput>NEG</computeroutput>,
				1159	<computeroutput>INC</computeroutput>,
				1160	<computeroutput>DEC</computeroutput>,
				1161	<computeroutput>BSWAP</computeroutput>,
				1162	<computeroutput>CC2VAL</computeroutput> and
				1163	<computeroutput>WIDEN</computeroutput>.
				1164	<computeroutput>WIDEN</computeroutput> does signed or
				1165	unsigned value widening.
				1166	<computeroutput>CC2VAL</computeroutput> is used to convert
				1167	condition codes into a value, zero or one. The rest are
				1168	obvious.</para>
				1169
				1170	<para>To allow for more efficient code generation, we bend
				1171	slightly the restriction at the start of the previous para:
				1172	for <computeroutput>ADD</computeroutput>,
				1173	<computeroutput>ADC</computeroutput>,
				1174	<computeroutput>XOR</computeroutput>,
				1175	<computeroutput>SUB</computeroutput> and
				1176	<computeroutput>SBB</computeroutput>, we allow the first
				1177	(source) operand to also be an
				1178	<computeroutput>ArchReg</computeroutput>, that is, one of the
				1179	simulated machine's registers. Also, many of these ALU ops
				1180	allow the source operand to be a literal. See
				1181	<computeroutput>VG_(saneUInstr)</computeroutput> for the
				1182	final word on the allowable forms of uinstrs.</para>
				1183	</listitem>
				1184
				1185	<listitem>
				1186	<para><computeroutput>LEA1</computeroutput> and
				1187	<computeroutput>LEA2</computeroutput> are not strictly
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	1188	necessary, but facilitate better translations. They
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1189	record the fancy x86 addressing modes in a direct way, which
				1190	allows those amodes to be emitted back into the final
				1191	instruction stream more or less verbatim.</para>
				1192	</listitem>
				1193
				1194	<listitem>
				1195	<para><computeroutput>CALLM</computeroutput> calls a
				1196	machine-code helper, one of the methods whose address is
				1197	stored at some
				1198	<computeroutput>VG_(baseBlock)</computeroutput> offset.
				1199	<computeroutput>PUSH</computeroutput> and
				1200	<computeroutput>POP</computeroutput> move values to/from
				1201	<computeroutput>TempReg</computeroutput> to the real
				1202	(Valgrind's) stack, and
				1203	<computeroutput>CLEAR</computeroutput> removes values from
				1204	the stack. <computeroutput>CALLM_S</computeroutput> and
				1205	<computeroutput>CALLM_E</computeroutput> delimit the
				1206	boundaries of call setups and clearings, for the benefit of
				1207	the instrumentation passes. Getting this right is critical,
				1208	and so <computeroutput>VG_(saneUCodeBlock)</computeroutput>
				1209	makes various checks on the use of these uopcodes.</para>
				1210
				1211	<para>It is important to understand that these uopcodes have
				1212	nothing to do with the x86
				1213	<computeroutput>call</computeroutput>,
				1214	<computeroutput>return,</computeroutput>
				1215	<computeroutput>push</computeroutput> or
				1216	<computeroutput>pop</computeroutput> instructions, and are
				1217	not used to implement them. Those guys turn into
				1218	combinations of <computeroutput>GET</computeroutput>,
				1219	<computeroutput>PUT</computeroutput>,
				1220	<computeroutput>LOAD</computeroutput>,
				1221	<computeroutput>STORE</computeroutput>,
				1222	<computeroutput>ADD</computeroutput>,
				1223	<computeroutput>SUB</computeroutput>, and
				1224	<computeroutput>JMP</computeroutput>. What these uopcodes
				1225	support is calling of helper functions such as
				1226	<computeroutput>VG_(helper_imul_32_64)</computeroutput>,
				1227	which do stuff which is too difficult or tedious to emit
				1228	inline.</para>
				1229	</listitem>
				1230
				1231	<listitem>
				1232	<para><computeroutput>FPU</computeroutput>,
				1233	<computeroutput>FPU_R</computeroutput> and
				1234	<computeroutput>FPU_W</computeroutput>. Valgrind doesn't
				1235	attempt to simulate the internal state of the FPU at all.
				1236	Consequently it only needs to be able to distinguish FPU ops
				1237	which read and write memory from those that don't, and for
				1238	those which do, it needs to know the effective address and
				1239	data transfer size. This is made easier because the x86 FP
				1240	instruction encoding is very regular, basically consisting of
				1241	16 bits for a non-memory FPU insn and 11 (IIRC) bits + an
				1242	address mode for a memory FPU insn. So our
				1243	<computeroutput>FPU</computeroutput> uinstr carries the 16
				1244	bits in its <computeroutput>val1</computeroutput> field. And
				1245	<computeroutput>FPU_R</computeroutput> and
				1246	<computeroutput>FPU_W</computeroutput> carry 11 bits in that
				1247	field, together with the identity of a
				1248	<computeroutput>TempReg</computeroutput> or (later)
				1249	<computeroutput>RealReg</computeroutput> which contains the
				1250	address.</para>
				1251	</listitem>
				1252
				1253	<listitem>
				1254	<para><computeroutput>JIFZ</computeroutput> is unique, in
				1255	that it allows a control-flow transfer which is not deemed to
				1256	end a basic block. It causes a jump to a literal (original)
				1257	address if the specified argument is zero.</para>
				1258	</listitem>
				1259
				1260	<listitem>
				1261	<para>Finally, <computeroutput>INCEIP</computeroutput>
				1262	advances the simulated <computeroutput>%EIP</computeroutput>
				1263	by the specified literal amount. This supports lazy
				1264	<computeroutput>%EIP</computeroutput> updating, as described
				1265	below.</para>
				1266	</listitem>
				1267
				1268	</itemizedlist>
				1269
				1270	<para>Stages 1 and 2 of the 6-stage translation process mentioned
				1271	above deal purely with these uopcodes, and no others. They are
				1272	sufficient to express pretty much all the x86 32-bit
				1273	protected-mode instruction set, at least everything understood by
				1274	a pre-MMX original Pentium (P54C).</para>
				1275
				1276	<para>Stages 3, 4, 5 and 6 also deal with the following extra
				1277	"instrumentation" uopcodes. They are used to express all the
				1278	definedness-tracking and -checking machinery which valgrind does.
				1279	In later sections we show how to create checking code for each of
				1280	the uopcodes above. Note that these instrumentation uopcodes,
				1281	although some appearing complicated, have been carefully chosen
				1282	so that efficient x86 code can be generated for them. GNU
				1283	superopt v2.5 did a great job helping out here. Anyways, the
				1284	uopcodes are as follows:</para>
				1285
				1286	<itemizedlist>
				1287
				1288	<listitem>
				1289	<para><computeroutput>GETV</computeroutput> and
				1290	<computeroutput>PUTV</computeroutput> are analogues to
				1291	<computeroutput>GET</computeroutput> and
				1292	<computeroutput>PUT</computeroutput> above. They are
				1293	identical except that they move the V bits for the specified
				1294	values back and forth to
				1295	<computeroutput>TempRegs</computeroutput>, rather than moving
				1296	the values themselves.</para>
				1297	</listitem>
				1298
				1299	<listitem>
				1300	<para>Similarly, <computeroutput>LOADV</computeroutput> and
				1301	<computeroutput>STOREV</computeroutput> read and write V bits
				1302	from the synthesised shadow memory that Valgrind maintains.
				1303	In fact they do more than that, since they also do
				1304	address-validity checks, and emit complaints if the
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	1305	read/written addresses are unaddressable.</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1306	</listitem>
				1307
				1308	<listitem>
				1309	<para><computeroutput>TESTV</computeroutput>, whose
				1310	parameters are a <computeroutput>TempReg</computeroutput> and
				1311	a size, tests the V bits in the
				1312	<computeroutput>TempReg</computeroutput>, at the specified
				1313	operation size (0/1/2/4 byte) and emits an error if any of
				1314	them indicate undefinedness. This is the only uopcode
				1315	capable of doing such tests.</para>
				1316	</listitem>
				1317
				1318	<listitem>
				1319	<para><computeroutput>SETV</computeroutput>, whose parameters
				1320	are also <computeroutput>TempReg</computeroutput> and a size,
				1321	makes the V bits in the
				1322	<computeroutput>TempReg</computeroutput> indicated
				1323	definedness, at the specified operation size. This is
				1324	usually used to generate the correct V bits for a literal
				1325	value, which is of course fully defined.</para>
				1326	</listitem>
				1327
				1328	<listitem>
				1329	<para><computeroutput>GETVF</computeroutput> and
				1330	<computeroutput>PUTVF</computeroutput> are analogues to
				1331	<computeroutput>GETF</computeroutput> and
				1332	<computeroutput>PUTF</computeroutput>. They move the single
				1333	V bit used to model definedness of
				1334	<computeroutput>%EFLAGS</computeroutput> between its home in
				1335	<computeroutput>VG_(baseBlock)</computeroutput> and the
				1336	specified <computeroutput>TempReg</computeroutput>.</para>
				1337	</listitem>
				1338
				1339	<listitem>
				1340	<para><computeroutput>TAG1</computeroutput> denotes one of a
				1341	family of unary operations on
				1342	<computeroutput>TempReg</computeroutput>s containing V bits.
				1343	Similarly, <computeroutput>TAG2</computeroutput> denotes one
				1344	in a family of binary operations on V bits.</para>
				1345	</listitem>
				1346
				1347	</itemizedlist>
				1348
				1349
				1350	<para>These 10 uopcodes are sufficient to express Valgrind's
				1351	entire definedness-checking semantics. In fact most of the
				1352	interesting magic is done by the
				1353	<computeroutput>TAG1</computeroutput> and
				1354	<computeroutput>TAG2</computeroutput> suboperations.</para>
				1355
				1356	<para>First, however, I need to explain about V-vector operation
				1357	sizes. There are 4 sizes: 1, 2 and 4, which operate on groups of
				1358	8, 16 and 32 V bits at a time, supporting the usual 1, 2 and 4
				1359	byte x86 operations. However there is also the mysterious size
				1360	0, which really means a single V bit. Single V bits are used in
				1361	various circumstances; in particular, the definedness of
				1362	<computeroutput>%EFLAGS</computeroutput> is modelled with a
				1363	single V bit. Now might be a good time to also point out that
				1364	for V bits, 1 means "undefined" and 0 means "defined".
				1365	Similarly, for A bits, 1 means "invalid address" and 0 means
				1366	"valid address". This seems counterintuitive (and so it is), but
				1367	testing against zero on x86s saves instructions compared to
				1368	testing against all 1s, because many ALU operations set the Z
				1369	flag for free, so to speak.</para>
				1370
				1371	<para>With that in mind, the tag ops are:</para>
				1372
				1373	<itemizedlist>
				1374
				1375	<listitem>
				1376	<formalpara>
				1377	<title>(UNARY) Pessimising casts:</title>
				1378	<para><computeroutput>VgT_PCast40</computeroutput>,
				1379	<computeroutput>VgT_PCast20</computeroutput>,
				1380	<computeroutput>VgT_PCast10</computeroutput>,
				1381	<computeroutput>VgT_PCast01</computeroutput>,
				1382	<computeroutput>VgT_PCast02</computeroutput> and
				1383	<computeroutput>VgT_PCast04</computeroutput>. A "pessimising
				1384	cast" takes a V-bit vector at one size, and creates a new one
				1385	at another size, pessimised in the sense that if any of the
				1386	bits in the source vector indicate undefinedness, then all
				1387	the bits in the result indicate undefinedness. In this case
				1388	the casts are all to or from a single V bit, so for example
				1389	<computeroutput>VgT_PCast40</computeroutput> is a pessimising
				1390	cast from 32 bits to 1, whereas
				1391	<computeroutput>VgT_PCast04</computeroutput> simply copies
				1392	the single source V bit into all 32 bit positions in the
				1393	result. Surprisingly, these ops can all be implemented very
				1394	efficiently.</para>
				1395	</formalpara>
				1396
				1397	<para>There are also the pessimising casts
				1398	<computeroutput>VgT_PCast14</computeroutput>, from 8 bits to
				1399	32, <computeroutput>VgT_PCast12</computeroutput>, from 8 bits
				1400	to 16, and <computeroutput>VgT_PCast11</computeroutput>, from
				1401	8 bits to 8. This last one seems nonsensical, but in fact it
				1402	isn't a no-op because, as mentioned above, any undefined (1)
				1403	bits in the source infect the entire result.</para>
				1404	</listitem>
				1405
				1406	<listitem>
				1407	<formalpara>
				1408	<title>(UNARY) Propagating undefinedness upwards in a
				1409	word:</title>
				1410	<para><computeroutput>VgT_Left4</computeroutput>,
				1411	<computeroutput>VgT_Left2</computeroutput> and
				1412	<computeroutput>VgT_Left1</computeroutput>. These are used
				1413	to simulate the worst-case effects of carry propagation in
				1414	adds and subtracts. They return a V vector identical to the
				1415	original, except that if the original contained any undefined
				1416	bits, then it and all bits above it are marked as undefined
				1417	too. Hence the Left bit in the names.</para></formalpara>
				1418	</listitem>
				1419
				1420	<listitem>
				1421	<formalpara>
				1422	<title>(UNARY) Signed and unsigned value widening:</title>
				1423	<para><computeroutput>VgT_SWiden14</computeroutput>,
				1424	<computeroutput>VgT_SWiden24</computeroutput>,
				1425	<computeroutput>VgT_SWiden12</computeroutput>,
				1426	<computeroutput>VgT_ZWiden14</computeroutput>,
				1427	<computeroutput>VgT_ZWiden24</computeroutput> and
				1428	<computeroutput>VgT_ZWiden12</computeroutput>. These mimic
				1429	the definedness effects of standard signed and unsigned
				1430	integer widening. Unsigned widening creates zero bits in the
				1431	new positions, so
				1432	<computeroutput>VgT_ZWiden*</computeroutput> accordingly park
				1433	mark those parts of their argument as defined. Signed
				1434	widening copies the sign bit into the new positions, so
				1435	<computeroutput>VgT_SWiden*</computeroutput> copies the
				1436	definedness of the sign bit into the new positions. Because
				1437	1 means undefined and 0 means defined, these operations can
				1438	(fascinatingly) be done by the same operations which they
				1439	mimic. Go figure.</para>
				1440	</formalpara>
				1441	</listitem>
				1442
				1443	<listitem>
				1444	<formalpara>
				1445	<title>(BINARY) Undefined-if-either-Undefined,
				1446	Defined-if-either-Defined:</title>
				1447	<para><computeroutput>VgT_UifU4</computeroutput>,
				1448	<computeroutput>VgT_UifU2</computeroutput>,
				1449	<computeroutput>VgT_UifU1</computeroutput>,
				1450	<computeroutput>VgT_UifU0</computeroutput>,
				1451	<computeroutput>VgT_DifD4</computeroutput>,
				1452	<computeroutput>VgT_DifD2</computeroutput>,
				1453	<computeroutput>VgT_DifD1</computeroutput>. These do simple
				1454	bitwise operations on pairs of V-bit vectors, with
				1455	<computeroutput>UifU</computeroutput> giving undefined if
				1456	either arg bit is undefined, and
				1457	<computeroutput>DifD</computeroutput> giving defined if
				1458	either arg bit is defined. Abstract interpretation junkies,
				1459	if any make it this far, may like to think of them as meets
				1460	and joins (or is it joins and meets) in the definedness
				1461	lattices.</para>
				1462	</formalpara>
				1463	</listitem>
				1464
				1465	<listitem>
				1466	<formalpara>
				1467	<title>(BINARY; one value, one V bits) Generate argument
				1468	improvement terms for AND and OR</title>
				1469	<para><computeroutput>VgT_ImproveAND4_TQ</computeroutput>,
				1470	<computeroutput>VgT_ImproveAND2_TQ</computeroutput>,
				1471	<computeroutput>VgT_ImproveAND1_TQ</computeroutput>,
				1472	<computeroutput>VgT_ImproveOR4_TQ</computeroutput>,
				1473	<computeroutput>VgT_ImproveOR2_TQ</computeroutput>,
				1474	<computeroutput>VgT_ImproveOR1_TQ</computeroutput>. These
				1475	help out with AND and OR operations. AND and OR have the
				1476	inconvenient property that the definedness of the result
				1477	depends on the actual values of the arguments as well as
				1478	their definedness. At the bit level:</para></formalpara>
				1479	<programlisting><![CDATA[
				1480	1 AND undefined = undefined, but
				1481	0 AND undefined = 0, and
				1482	similarly
				1483	0 OR undefined = undefined, but
				1484	1 OR undefined = 1.]]></programlisting>
				1485
				1486	<para>It turns out that gcc (quite legitimately) generates
				1487	code which relies on this fact, so we have to model it
				1488	properly in order to avoid flooding users with spurious value
				1489	errors. The ultimate definedness result of AND and OR is
				1490	calculated using <computeroutput>UifU</computeroutput> on the
				1491	definedness of the arguments, but we also
				1492	<computeroutput>DifD</computeroutput> in some "improvement"
				1493	terms which take into account the above phenomena.</para>
				1494
				1495	<para><computeroutput>ImproveAND</computeroutput> takes as
				1496	its first argument the actual value of an argument to AND
				1497	(the T) and the definedness of that argument (the Q), and
				1498	returns a V-bit vector which is defined (0) for bits which
				1499	have value 0 and are defined; this, when
				1500	<computeroutput>DifD</computeroutput> into the final result
				1501	causes those bits to be defined even if the corresponding bit
				1502	in the other argument is undefined.</para>
				1503
				1504	<para>The <computeroutput>ImproveOR</computeroutput> ops do
				1505	the dual thing for OR arguments. Note that XOR does not have
				1506	this property that one argument can make the other
				1507	irrelevant, so there is no need for such complexity for
				1508	XOR.</para>
				1509	</listitem>
				1510
				1511	</itemizedlist>
				1512
				1513	<para>That's all the tag ops. If you stare at this long enough,
				1514	and then run Valgrind and stare at the pre- and post-instrumented
				1515	ucode, it should be fairly obvious how the instrumentation
				1516	machinery hangs together.</para>
				1517
				1518	<para>One point, if you do this: in order to make it easy to
				1519	differentiate <computeroutput>TempReg</computeroutput>s carrying
				1520	values from <computeroutput>TempReg</computeroutput>s carrying V
				1521	bit vectors, Valgrind prints the former as (for example)
				1522	<computeroutput>t28</computeroutput> and the latter as
				1523	<computeroutput>q28</computeroutput>; the fact that they carry
				1524	the same number serves to indicate their relationship. This is
				1525	purely for the convenience of the human reader; the register
				1526	allocator and code generator don't regard them as
				1527	different.</para>
				1528
				1529	</sect2>
				1530
				1531
				1532
de	ccde45e	2005-06-12 10:23:23 +0000	[diff] [blame]	1533	<sect2 id="mc-tech-docs.trans" xreflabel="Translation into UCode">
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1534	<title>Translation into UCode</title>
				1535
				1536	<para><computeroutput>VG_(disBB)</computeroutput> allocates a new
				1537	<computeroutput>UCodeBlock</computeroutput> and then uses
				1538	<computeroutput>disInstr</computeroutput> to translate x86
				1539	instructions one at a time into UCode, dumping the result in the
				1540	<computeroutput>UCodeBlock</computeroutput>. This goes on until
				1541	a control-flow transfer instruction is encountered.</para>
				1542
				1543	<para>Despite the large size of
				1544	<filename>vg_to_ucode.c</filename>, this translation is really
				1545	very simple. Each x86 instruction is translated entirely
				1546	independently of its neighbours, merrily allocating new
				1547	<computeroutput>TempReg</computeroutput>s as it goes. The idea
				1548	is to have a simple translator -- in reality, no more than a
				1549	macro-expander -- and the -- resulting bad UCode translation is
				1550	cleaned up by the UCode optimisation phase which follows. To
				1551	give you an idea of some x86 instructions and their translations
				1552	(this is a complete basic block, as Valgrind sees it):</para>
				1553	<programlisting><![CDATA[
				1554	0x40435A50: incl %edx
				1555	0: GETL %EDX, t0
				1556	1: INCL t0 (-wOSZAP)
				1557	2: PUTL t0, %EDX
				1558
				1559	0x40435A51: movsbl (%edx),%eax
				1560	3: GETL %EDX, t2
				1561	4: LDB (t2), t2
				1562	5: WIDENL_Bs t2
				1563	6: PUTL t2, %EAX
				1564
				1565	0x40435A54: testb $0x20, 1(%ecx,%eax,2)
				1566	7: GETL %EAX, t6
				1567	8: GETL %ECX, t8
				1568	9: LEA2L 1(t8,t6,2), t4
				1569	10: LDB (t4), t10
				1570	11: MOVB $0x20, t12
				1571	12: ANDB t12, t10 (-wOSZACP)
				1572	13: INCEIPo $9
				1573
				1574	0x40435A59: jnz-8 0x40435A50
				1575	14: Jnzo $0x40435A50 (-rOSZACP)
				1576	15: JMPo $0x40435A5B]]></programlisting>
				1577
				1578	<para>Notice how the block always ends with an unconditional jump
				1579	to the next block. This is a bit unnecessary, but makes many
				1580	things simpler.</para>
				1581
				1582	<para>Most x86 instructions turn into sequences of
				1583	<computeroutput>GET</computeroutput>,
				1584	<computeroutput>PUT</computeroutput>,
				1585	<computeroutput>LEA1</computeroutput>,
				1586	<computeroutput>LEA2</computeroutput>,
				1587	<computeroutput>LOAD</computeroutput> and
				1588	<computeroutput>STORE</computeroutput>. Some complicated ones
				1589	however rely on calling helper bits of code in
				1590	<filename>vg_helpers.S</filename>. The ucode instructions
				1591	<computeroutput>PUSH</computeroutput>,
				1592	<computeroutput>POP</computeroutput>,
				1593	<computeroutput>CALL</computeroutput>,
				1594	<computeroutput>CALLM_S</computeroutput> and
				1595	<computeroutput>CALLM_E</computeroutput> support this. The
				1596	calling convention is somewhat ad-hoc and is not the C calling
				1597	convention. The helper routines must save all integer registers,
				1598	and the flags, that they use. Args are passed on the stack
				1599	underneath the return address, as usual, and if result(s) are to
				1600	be returned, it (they) are either placed in dummy arg slots
				1601	created by the ucode <computeroutput>PUSH</computeroutput>
				1602	sequence, or just overwrite the incoming args.</para>
				1603
				1604	<para>In order that the instrumentation mechanism can handle
				1605	calls to these helpers,
				1606	<computeroutput>VG_(saneUCodeBlock)</computeroutput> enforces the
				1607	following restrictions on calls to helpers:</para>
				1608
				1609	<itemizedlist>
				1610
				1611	<listitem>
				1612	<para>Each <computeroutput>CALL</computeroutput> uinstr must
				1613	be bracketed by a preceding
				1614	<computeroutput>CALLM_S</computeroutput> marker (dummy
				1615	uinstr) and a trailing
				1616	<computeroutput>CALLM_E</computeroutput> marker. These
				1617	markers are used by the instrumentation mechanism later to
				1618	establish the boundaries of the
				1619	<computeroutput>PUSH</computeroutput>,
				1620	<computeroutput>POP</computeroutput> and
				1621	<computeroutput>CLEAR</computeroutput> sequences for the
				1622	call.</para>
				1623	</listitem>
				1624
				1625	<listitem>
				1626	<para><computeroutput>PUSH</computeroutput>,
				1627	<computeroutput>POP</computeroutput> and
				1628	<computeroutput>CLEAR</computeroutput> may only appear inside
				1629	sections bracketed by
				1630	<computeroutput>CALLM_S</computeroutput> and
				1631	<computeroutput>CALLM_E</computeroutput>, and nowhere else.</para>
				1632	</listitem>
				1633
				1634	<listitem>
				1635	<para>In any such bracketed section, no two
				1636	<computeroutput>PUSH</computeroutput> insns may push the same
				1637	<computeroutput>TempReg</computeroutput>. Dually, no two two
				1638	<computeroutput>POP</computeroutput>s may pop the same
				1639	<computeroutput>TempReg</computeroutput>.</para>
				1640	</listitem>
				1641
				1642	<listitem>
				1643	<para>Finally, although this is not checked, args should be
				1644	removed from the stack with
				1645	<computeroutput>CLEAR</computeroutput>, rather than
				1646	<computeroutput>POP</computeroutput>s into a
				1647	<computeroutput>TempReg</computeroutput> which is not
				1648	subsequently used. This is because the instrumentation
				1649	mechanism assumes that all values
				1650	<computeroutput>POP</computeroutput>ped from the stack are
				1651	actually used.</para>
				1652	</listitem>
				1653
				1654	</itemizedlist>
				1655
				1656	<para>Some of the translations may appear to have redundant
				1657	<computeroutput>TempReg</computeroutput>-to-<computeroutput>TempReg</computeroutput>
				1658	moves. This helps the next phase, UCode optimisation, to
				1659	generate better code.</para>
				1660
				1661	</sect2>
				1662
				1663
				1664
				1665	<sect2 id="mc-tech-docs.optim" xreflabel="UCode optimisation">
				1666	<title>UCode optimisation</title>
				1667
				1668	<para>UCode is then subjected to an improvement pass
				1669	(<computeroutput>vg_improve()</computeroutput>), which blurs the
				1670	boundaries between the translations of the original x86
				1671	instructions. It's pretty straightforward. Three
				1672	transformations are done:</para>
				1673
				1674	<itemizedlist>
				1675
				1676	<listitem>
				1677	<para>Redundant <computeroutput>GET</computeroutput>
				1678	elimination. Actually, more general than that -- eliminates
				1679	redundant fetches of ArchRegs. In our running example,
				1680	uinstr 3 <computeroutput>GET</computeroutput>s
				1681	<computeroutput>%EDX</computeroutput> into
				1682	<computeroutput>t2</computeroutput> despite the fact that, by
				1683	looking at the previous uinstr, it is already in
				1684	<computeroutput>t0</computeroutput>. The
				1685	<computeroutput>GET</computeroutput> is therefore removed,
				1686	and <computeroutput>t2</computeroutput> renamed to
				1687	<computeroutput>t0</computeroutput>. Assuming
				1688	<computeroutput>t0</computeroutput> is allocated to a host
				1689	register, it means the simulated
				1690	<computeroutput>%EDX</computeroutput> will exist in a host
				1691	CPU register for more than one simulated x86 instruction,
				1692	which seems to me to be a highly desirable property.</para>
				1693
				1694	<para>There is some mucking around to do with subregisters;
				1695	<computeroutput>%AL</computeroutput> vs
				1696	<computeroutput>%AH</computeroutput>
				1697	<computeroutput>%AX</computeroutput> vs
				1698	<computeroutput>%EAX</computeroutput> etc. I can't remember
				1699	how it works, but in general we are very conservative, and
				1700	these tend to invalidate the caching.</para>
				1701	</listitem>
				1702
				1703	<listitem>
				1704	<para>Redundant <computeroutput>PUT</computeroutput>
				1705	elimination. This annuls
				1706	<computeroutput>PUT</computeroutput>s of values back to
				1707	simulated CPU registers if a later
				1708	<computeroutput>PUT</computeroutput> would overwrite the
				1709	earlier <computeroutput>PUT</computeroutput> value, and there
				1710	is no intervening reads of the simulated register
				1711	(<computeroutput>ArchReg</computeroutput>).</para>
				1712
				1713	<para>As before, we are paranoid when faced with subregister
				1714	references. Also, <computeroutput>PUT</computeroutput>s of
				1715	<computeroutput>%ESP</computeroutput> are never annulled,
				1716	because it is vital the instrumenter always has an up-to-date
				1717	<computeroutput>%ESP</computeroutput> value available,
				1718	<computeroutput>%ESP</computeroutput> changes affect
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	1719	addressability of the memory around the simulated stack
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	1720	pointer.</para>
				1721
				1722	<para>The implication of the above paragraph is that the
				1723	simulated machine's registers are only lazily updated once
				1724	the above two optimisation phases have run, with the
				1725	exception of <computeroutput>%ESP</computeroutput>.
				1726	<computeroutput>TempReg</computeroutput>s go dead at the end
				1727	of every basic block, from which is is inferrable that any
				1728	<computeroutput>TempReg</computeroutput> caching a simulated
				1729	CPU reg is flushed (back into the relevant
				1730	<computeroutput>VG_(baseBlock)</computeroutput> slot) at the
				1731	end of every basic block. The further implication is that
				1732	the simulated registers are only up-to-date at in between
				1733	basic blocks, and not at arbitrary points inside basic
				1734	blocks. And the consequence of that is that we can only
				1735	deliver signals to the client in between basic blocks. None
				1736	of this seems any problem in practice.</para>
				1737	</listitem>
				1738
				1739	<listitem>
				1740	<para>Finally there is a simple def-use thing for condition
				1741	codes. If an earlier uinstr writes the condition codes, and
				1742	the next uinsn along which actually cares about the condition
				1743	codes writes the same or larger set of them, but does not
				1744	read any, the earlier uinsn is marked as not writing any
				1745	condition codes. This saves a lot of redundant cond-code
				1746	saving and restoring.</para>
				1747	</listitem>
				1748
				1749	</itemizedlist>
				1750
				1751	<para>The effect of these transformations on our short block is
				1752	rather unexciting, and shown below. On longer basic blocks they
				1753	can dramatically improve code quality.</para>
				1754
				1755	<programlisting><![CDATA[
				1756	at 3: delete GET, rename t2 to t0 in (4 .. 6)
				1757	at 7: delete GET, rename t6 to t0 in (8 .. 9)
				1758	at 1: annul flag write OSZAP due to later OSZACP
				1759
				1760	Improved code:
				1761	0: GETL %EDX, t0
				1762	1: INCL t0
				1763	2: PUTL t0, %EDX
				1764	4: LDB (t0), t0
				1765	5: WIDENL_Bs t0
				1766	6: PUTL t0, %EAX
				1767	8: GETL %ECX, t8
				1768	9: LEA2L 1(t8,t0,2), t4
				1769	10: LDB (t4), t10
				1770	11: MOVB $0x20, t12
				1771	12: ANDB t12, t10 (-wOSZACP)
				1772	13: INCEIPo $9
				1773	14: Jnzo $0x40435A50 (-rOSZACP)
				1774	15: JMPo $0x40435A5B]]></programlisting>
				1775
				1776	</sect2>
				1777
				1778
				1779
				1780	<sect2 id="mc-tech-docs.instrum" xreflabel="UCode instrumentation">
				1781	<title>UCode instrumentation</title>
				1782
				1783	<para>Once you understand the meaning of the instrumentation
				1784	uinstrs, discussed in detail above, the instrumentation scheme is
				1785	fairly straightforward. Each uinstr is instrumented in
				1786	isolation, and the instrumentation uinstrs are placed before the
				1787	original uinstr. Our running example continues below. I have
				1788	placed a blank line after every original ucode, to make it easier
				1789	to see which instrumentation uinstrs correspond to which
				1790	originals.</para>
				1791
				1792	<para>As mentioned somewhere above,
				1793	<computeroutput>TempReg</computeroutput>s carrying values have
				1794	names like <computeroutput>t28</computeroutput>, and each one has
				1795	a shadow carrying its V bits, with names like
				1796	<computeroutput>q28</computeroutput>. This pairing aids in
				1797	reading instrumented ucode.</para>
				1798
				1799	<para>One decision about all this is where to have "observation
				1800	points", that is, where to check that V bits are valid. I use a
				1801	minimalistic scheme, only checking where a failure of validity
				1802	could cause the original program to (seg)fault. So the use of
				1803	values as memory addresses causes a check, as do conditional
				1804	jumps (these cause a check on the definedness of the condition
				1805	codes). And arguments <computeroutput>PUSH</computeroutput>ed
				1806	for helper calls are checked, hence the weird restrictions on
				1807	help call preambles described above.</para>
				1808
				1809	<para>Another decision is that once a value is tested, it is
				1810	thereafter regarded as defined, so that we do not emit multiple
				1811	undefined-value errors for the same undefined value. That means
				1812	that <computeroutput>TESTV</computeroutput> uinstrs are always
				1813	followed by <computeroutput>SETV</computeroutput> on the same
				1814	(shadow) <computeroutput>TempReg</computeroutput>s. Most of
				1815	these <computeroutput>SETV</computeroutput>s are redundant and
				1816	are removed by the post-instrumentation cleanup phase.</para>
				1817
				1818	<para>The instrumentation for calling helper functions deserves
				1819	further comment. The definedness of results from a helper is
				1820	modelled using just one V bit. So, in short, we do pessimising
				1821	casts of the definedness of all the args, down to a single bit,
				1822	and then <computeroutput>UifU</computeroutput> these bits
				1823	together. So this single V bit will say "undefined" if any part
				1824	of any arg is undefined. This V bit is then pessimally cast back
				1825	up to the result(s) sizes, as needed. If, by seeing that all the
				1826	args are got rid of with <computeroutput>CLEAR</computeroutput>
				1827	and none with <computeroutput>POP</computeroutput>, Valgrind sees
				1828	that the result of the call is not actually used, it immediately
				1829	examines the result V bit with a
				1830	<computeroutput>TESTV</computeroutput> --
				1831	<computeroutput>SETV</computeroutput> pair. If it did not do
				1832	this, there would be no observation point to detect that the some
				1833	of the args to the helper were undefined. Of course, if the
				1834	helper's results are indeed used, we don't do this, since the
				1835	result usage will presumably cause the result definedness to be
				1836	checked at some suitable future point.</para>
				1837
				1838	<para>In general Valgrind tries to track definedness on a
				1839	bit-for-bit basis, but as the above para shows, for calls to
				1840	helpers we throw in the towel and approximate down to a single
				1841	bit. This is because it's too complex and difficult to track
				1842	bit-level definedness through complex ops such as integer
				1843	multiply and divide, and in any case there is no reasonable code
				1844	fragments which attempt to (eg) multiply two partially-defined
				1845	values and end up with something meaningful, so there seems
				1846	little point in modelling multiplies, divides, etc, in that level
				1847	of detail.</para>
				1848
				1849	<para>Integer loads and stores are instrumented with firstly a
				1850	test of the definedness of the address, followed by a
				1851	<computeroutput>LOADV</computeroutput> or
				1852	<computeroutput>STOREV</computeroutput> respectively. These turn
				1853	into calls to (for example)
				1854	<computeroutput>VG_(helperc_LOADV4)</computeroutput>. These
				1855	helpers do two things: they perform an address-valid check, and
				1856	they load or store V bits from/to the relevant address in the
				1857	(simulated V-bit) memory.</para>
				1858
				1859	<para>FPU loads and stores are different. As above the
				1860	definedness of the address is first tested. However, the helper
				1861	routine for FPU loads
				1862	(<computeroutput>VGM_(fpu_read_check)</computeroutput>) emits an
				1863	error if either the address is invalid or the referenced area
				1864	contains undefined values. It has to do this because we do not
				1865	simulate the FPU at all, and so cannot track definedness of
				1866	values loaded into it from memory, so we have to check them as
				1867	soon as they are loaded into the FPU, ie, at this point. We
				1868	notionally assume that everything in the FPU is defined.</para>
				1869
				1870	<para>It follows therefore that FPU writes first check the
				1871	definedness of the address, then the validity of the address, and
				1872	finally mark the written bytes as well-defined.</para>
				1873
				1874	<para>If anyone is inspired to extend Valgrind to MMX/SSE insns,
				1875	I suggest you use the same trick. It works provided that the
				1876	FPU/MMX unit is not used to merely as a conduit to copy partially
				1877	undefined data from one place in memory to another.
				1878	Unfortunately the integer CPU is used like that (when copying C
				1879	structs with holes, for example) and this is the cause of much of
				1880	the elaborateness of the instrumentation here described.</para>
				1881
				1882	<para><computeroutput>vg_instrument()</computeroutput> in
				1883	<filename>vg_translate.c</filename> actually does the
				1884	instrumentation. There are comments explaining how each uinstr
				1885	is handled, so we do not repeat that here. As explained already,
				1886	it is bit-accurate, except for calls to helper functions.
				1887	Unfortunately the x86 insns
				1888	<computeroutput>bt/bts/btc/btr</computeroutput> are done by
				1889	helper fns, so bit-level accuracy is lost there. This should be
				1890	fixed by doing them inline; it will probably require adding a
				1891	couple new uinstrs. Also, left and right rotates through the
				1892	carry flag (x86 <computeroutput>rcl</computeroutput> and
				1893	<computeroutput>rcr</computeroutput>) are approximated via a
				1894	single V bit; so far this has not caused anyone to complain. The
				1895	non-carry rotates, <computeroutput>rol</computeroutput> and
				1896	<computeroutput>ror</computeroutput>, are much more common and
				1897	are done exactly. Re-visiting the instrumentation for AND and
				1898	OR, they seem rather verbose, and I wonder if it could be done
				1899	more concisely now.</para>
				1900
				1901	<para>The lowercase <computeroutput>o</computeroutput> on many of
				1902	the uopcodes in the running example indicates that the size field
				1903	is zero, usually meaning a single-bit operation.</para>
				1904
				1905	<para>Anyroads, the post-instrumented version of our running
				1906	example looks like this:</para>
				1907
				1908	<programlisting><![CDATA[
				1909	Instrumented code:
				1910	0: GETVL %EDX, q0
				1911	1: GETL %EDX, t0
				1912
				1913	2: TAG1o q0 = Left4 ( q0 )
				1914	3: INCL t0
				1915
				1916	4: PUTVL q0, %EDX
				1917	5: PUTL t0, %EDX
				1918
				1919	6: TESTVL q0
				1920	7: SETVL q0
				1921	8: LOADVB (t0), q0
				1922	9: LDB (t0), t0
				1923
				1924	10: TAG1o q0 = SWiden14 ( q0 )
				1925	11: WIDENL_Bs t0
				1926
				1927	12: PUTVL q0, %EAX
				1928	13: PUTL t0, %EAX
				1929
				1930	14: GETVL %ECX, q8
				1931	15: GETL %ECX, t8
				1932
				1933	16: MOVL q0, q4
				1934	17: SHLL $0x1, q4
				1935	18: TAG2o q4 = UifU4 ( q8, q4 )
				1936	19: TAG1o q4 = Left4 ( q4 )
				1937	20: LEA2L 1(t8,t0,2), t4
				1938
				1939	21: TESTVL q4
				1940	22: SETVL q4
				1941	23: LOADVB (t4), q10
				1942	24: LDB (t4), t10
				1943
				1944	25: SETVB q12
				1945	26: MOVB $0x20, t12
				1946
				1947	27: MOVL q10, q14
				1948	28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
				1949	29: TAG2o q10 = UifU1 ( q12, q10 )
				1950	30: TAG2o q10 = DifD1 ( q14, q10 )
				1951	31: MOVL q12, q14
				1952	32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 )
				1953	33: TAG2o q10 = DifD1 ( q14, q10 )
				1954	34: MOVL q10, q16
				1955	35: TAG1o q16 = PCast10 ( q16 )
				1956	36: PUTVFo q16
				1957	37: ANDB t12, t10 (-wOSZACP)
				1958
				1959	38: INCEIPo $9
				1960
				1961	39: GETVFo q18
				1962	40: TESTVo q18
				1963	41: SETVo q18
				1964	42: Jnzo $0x40435A50 (-rOSZACP)
				1965
				1966	43: JMPo $0x40435A5B]]></programlisting>
				1967
				1968	</sect2>
				1969
				1970
				1971
				1972	<sect2 id="mc-tech-docs.cleanup"
				1973	xreflabel="UCode post-instrumentation cleanup">
				1974	<title>UCode post-instrumentation cleanup</title>
				1975
				1976	<para>This pass, coordinated by
				1977	<computeroutput>vg_cleanup()</computeroutput>, removes redundant
				1978	definedness computation created by the simplistic instrumentation
				1979	pass. It consists of two passes,
				1980	<computeroutput>vg_propagate_definedness()</computeroutput>
				1981	followed by
				1982	<computeroutput>vg_delete_redundant_SETVs</computeroutput>.</para>
				1983
				1984	<para><computeroutput>vg_propagate_definedness()</computeroutput>
				1985	is a simple constant-propagation and constant-folding pass. It
				1986	tries to determine which
				1987	<computeroutput>TempReg</computeroutput>s containing V bits will
				1988	always indicate "fully defined", and it propagates this
				1989	information as far as it can, and folds out as many operations as
				1990	possible. For example, the instrumentation for an ADD of a
				1991	literal to a variable quantity will be reduced down so that the
				1992	definedness of the result is simply the definedness of the
				1993	variable quantity, since the literal is by definition fully
				1994	defined.</para>
				1995
				1996	<para><computeroutput>vg_delete_redundant_SETVs</computeroutput>
				1997	removes <computeroutput>SETV</computeroutput>s on shadow
				1998	<computeroutput>TempReg</computeroutput>s for which the next
				1999	action is a write. I don't think there's anything else worth
				2000	saying about this; it is simple. Read the sources for
				2001	details.</para>
				2002
				2003	<para>So the cleaned-up running example looks like this. As
				2004	above, I have inserted line breaks after every original
				2005	(non-instrumentation) uinstr to aid readability. As with
				2006	straightforward ucode optimisation, the results in this block are
				2007	undramatic because it is so short; longer blocks benefit more
				2008	because they have more redundancy which gets eliminated.</para>
				2009
				2010	<programlisting><![CDATA[
				2011	at 29: delete UifU1 due to defd arg1
				2012	at 32: change ImproveAND1_TQ to MOV due to defd arg2
				2013	at 41: delete SETV
				2014	at 31: delete MOV
				2015	at 25: delete SETV
				2016	at 22: delete SETV
				2017	at 7: delete SETV
				2018
				2019	0: GETVL %EDX, q0
				2020	1: GETL %EDX, t0
				2021
				2022	2: TAG1o q0 = Left4 ( q0 )
				2023	3: INCL t0
				2024
				2025	4: PUTVL q0, %EDX
				2026	5: PUTL t0, %EDX
				2027
				2028	6: TESTVL q0
				2029	8: LOADVB (t0), q0
				2030	9: LDB (t0), t0
				2031
				2032	10: TAG1o q0 = SWiden14 ( q0 )
				2033	11: WIDENL_Bs t0
				2034
				2035	12: PUTVL q0, %EAX
				2036	13: PUTL t0, %EAX
				2037
				2038	14: GETVL %ECX, q8
				2039	15: GETL %ECX, t8
				2040
				2041	16: MOVL q0, q4
				2042	17: SHLL $0x1, q4
				2043	18: TAG2o q4 = UifU4 ( q8, q4 )
				2044	19: TAG1o q4 = Left4 ( q4 )
				2045	20: LEA2L 1(t8,t0,2), t4
				2046
				2047	21: TESTVL q4
				2048	23: LOADVB (t4), q10
				2049	24: LDB (t4), t10
				2050
				2051	26: MOVB $0x20, t12
				2052
				2053	27: MOVL q10, q14
				2054	28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 )
				2055	30: TAG2o q10 = DifD1 ( q14, q10 )
				2056	32: MOVL t12, q14
				2057	33: TAG2o q10 = DifD1 ( q14, q10 )
				2058	34: MOVL q10, q16
				2059	35: TAG1o q16 = PCast10 ( q16 )
				2060	36: PUTVFo q16
				2061	37: ANDB t12, t10 (-wOSZACP)
				2062
				2063	38: INCEIPo $9
				2064	39: GETVFo q18
				2065	40: TESTVo q18
				2066	42: Jnzo $0x40435A50 (-rOSZACP)
				2067
				2068	43: JMPo $0x40435A5B]]></programlisting>
				2069
				2070	</sect2>
				2071
				2072
				2073
				2074	<sect2 id="mc-tech-docs.transfrom" xreflabel="Translation from UCode">
				2075	<title>Translation from UCode</title>
				2076
				2077	<para>This is all very simple, even though
				2078	<filename>vg_from_ucode.c</filename> is a big file.
				2079	Position-independent x86 code is generated into a dynamically
				2080	allocated array <computeroutput>emitted_code</computeroutput>;
				2081	this is doubled in size when it overflows. Eventually the array
				2082	is handed back to the caller of
				2083	<computeroutput>VG_(translate)</computeroutput>, who must copy
				2084	the result into TC and TT, and free the array.</para>
				2085
				2086	<para>This file is structured into four layers of abstraction,
				2087	which, thankfully, are glued back together with extensive
				2088	<computeroutput>__inline__</computeroutput> directives. From the
				2089	bottom upwards:</para>
				2090
				2091	<itemizedlist>
				2092
				2093	<listitem>
				2094	<para>Address-mode emitters,
				2095	<computeroutput>emit_amode_regmem_reg</computeroutput> et
				2096	al.</para>
				2097	</listitem>
				2098
				2099	<listitem>
				2100	<para>Emitters for specific x86 instructions. There are
				2101	quite a lot of these, with names such as
				2102	<computeroutput>emit_movv_offregmem_reg</computeroutput>.
				2103	The <computeroutput>v</computeroutput> suffix is Intel
				2104	parlance for a 16/32 bit insn; there are also
				2105	<computeroutput>b</computeroutput> suffixes for 8 bit
				2106	insns.</para>
				2107	</listitem>
				2108
				2109	<listitem>
				2110	<para>The next level up are the
				2111	<computeroutput>synth_*</computeroutput> functions, which
				2112	synthesise possibly a sequence of raw x86 instructions to do
				2113	some simple task. Some of these are quite complex because
				2114	they have to work around Intel's silly restrictions on
				2115	subregister naming. See
				2116	<computeroutput>synth_nonshiftop_reg_reg</computeroutput> for
				2117	example.</para>
				2118	</listitem>
				2119
				2120	<listitem>
				2121	<para>Finally, at the top of the heap, we have
				2122	<computeroutput>emitUInstr()</computeroutput>, which emits
				2123	code for a single uinstr.</para>
				2124	</listitem>
				2125
				2126	</itemizedlist>
				2127
				2128	<para>Some comments:</para>
				2129
				2130	<itemizedlist>
				2131
				2132	<listitem>
				2133	<para>The hack for FPU instructions becomes apparent here.
				2134	To do a <computeroutput>FPU</computeroutput> ucode
				2135	instruction, we load the simulated FPU's state into from its
				2136	<computeroutput>VG_(baseBlock)</computeroutput> into the real
				2137	FPU using an x86 <computeroutput>frstor</computeroutput>
				2138	insn, do the ucode <computeroutput>FPU</computeroutput> insn
				2139	on the real CPU, and write the updated FPU state back into
				2140	<computeroutput>VG_(baseBlock)</computeroutput> using an
				2141	<computeroutput>fnsave</computeroutput> instruction. This is
				2142	pretty brutal, but is simple and it works, and even seems
				2143	tolerably efficient. There is no attempt to cache the
				2144	simulated FPU state in the real FPU over multiple
				2145	back-to-back ucode FPU instructions.</para>
				2146
				2147	<para><computeroutput>FPU_R</computeroutput> and
				2148	<computeroutput>FPU_W</computeroutput> are also done this
				2149	way, with the minor complication that we need to patch in
				2150	some addressing mode bits so the resulting insn knows the
				2151	effective address to use. This is easy because of the
				2152	regularity of the x86 FPU instruction encodings.</para>
				2153	</listitem>
				2154
				2155	<listitem>
				2156	<para>An analogous trick is done with ucode insns which
				2157	claim, in their <computeroutput>flags_r</computeroutput> and
				2158	<computeroutput>flags_w</computeroutput> fields, that they
				2159	read or write the simulated
				2160	<computeroutput>%EFLAGS</computeroutput>. For such cases we
				2161	first copy the simulated
				2162	<computeroutput>%EFLAGS</computeroutput> into the real
				2163	<computeroutput>%eflags</computeroutput>, then do the insn,
				2164	then, if the insn says it writes the flags, copy back to
				2165	<computeroutput>%EFLAGS</computeroutput>. This is a bit
				2166	expensive, which is why the ucode optimisation pass goes to
				2167	some effort to remove redundant flag-update annotations.</para>
				2168	</listitem>
				2169
				2170	</itemizedlist>
				2171
				2172	<para>And so ... that's the end of the documentation for the
				2173	instrumentating translator! It's really not that complex,
				2174	because it's composed as a sequence of simple(ish) self-contained
				2175	transformations on straight-line blocks of code.</para>
				2176
				2177	</sect2>
				2178
				2179
				2180
				2181	<sect2 id="mc-tech-docs.dispatch" xreflabel="Top-level dispatch loop">
				2182	<title>Top-level dispatch loop</title>
				2183
				2184	<para>Urk. In <computeroutput>VG_(toploop)</computeroutput>.
				2185	This is basically boring and unsurprising, not to mention fiddly
				2186	and fragile. It needs to be cleaned up.</para>
				2187
				2188	<para>The only perhaps surprise is that the whole thing is run on
				2189	top of a <computeroutput>setjmp</computeroutput>-installed
				2190	exception handler, because, supposing a translation got a
				2191	segfault, we have to bail out of the Valgrind-supplied exception
				2192	handler <computeroutput>VG_(oursignalhandler)</computeroutput>
				2193	and immediately start running the client's segfault handler, if
				2194	it has one. In particular we can't finish the current basic
				2195	block and then deliver the signal at some convenient future
				2196	point, because signals like SIGILL, SIGSEGV and SIGBUS mean that
				2197	the faulting insn should not simply be re-tried. (I'm sure there
				2198	is a clearer way to explain this).</para>
				2199
				2200	</sect2>
				2201
				2202
				2203
				2204	<sect2 id="mc-tech-docs.lazy"
				2205	xreflabel="Lazy updates of the simulated program counter">
				2206	<title>Lazy updates of the simulated program counter</title>
				2207
				2208	<para>Simulated <computeroutput>%EIP</computeroutput> is not
				2209	updated after every simulated x86 insn as this was regarded as
				2210	too expensive. Instead ucode
				2211	<computeroutput>INCEIP</computeroutput> insns move it along as
				2212	and when necessary. Currently we don't allow it to fall more
				2213	than 4 bytes behind reality (see
				2214	<computeroutput>VG_(disBB)</computeroutput> for the way this
				2215	works).</para>
				2216
				2217	<para>Note that <computeroutput>%EIP</computeroutput> is always
				2218	brought up to date by the inner dispatch loop in
				2219	<computeroutput>VG_(dispatch)</computeroutput>, so that if the
				2220	client takes a fault we know at least which basic block this
				2221	happened in.</para>
				2222
				2223	</sect2>
				2224
				2225
				2226
				2227	<sect2 id="mc-tech-docs.signals" xreflabel="Signals">
				2228	<title>Signals</title>
				2229
				2230	<para>Horrible, horrible. <filename>vg_signals.c</filename>.
				2231	Basically, since we have to intercept all system calls anyway, we
				2232	can see when the client tries to install a signal handler. If it
				2233	does so, we make a note of what the client asked to happen, and
				2234	ask the kernel to route the signal to our own signal handler,
				2235	<computeroutput>VG_(oursignalhandler)</computeroutput>. This
				2236	simply notes the delivery of signals, and returns.</para>
				2237
				2238	<para>Every 1000 basic blocks, we see if more signals have
				2239	arrived. If so,
				2240	<computeroutput>VG_(deliver_signals)</computeroutput> builds
				2241	signal delivery frames on the client's stack, and allows their
				2242	handlers to be run. Valgrind places in these signal delivery
				2243	frames a bogus return address,
				2244	<computeroutput>VG_(signalreturn_bogusRA)</computeroutput>, and
				2245	checks all jumps to see if any jump to it. If so, this is a sign
				2246	that a signal handler is returning, and if so Valgrind removes
				2247	the relevant signal frame from the client's stack, restores the
				2248	from the signal frame the simulated state before the signal was
				2249	delivered, and allows the client to run onwards. We have to do
				2250	it this way because some signal handlers never return, they just
				2251	<computeroutput>longjmp()</computeroutput>, which nukes the
				2252	signal delivery frame.</para>
				2253
				2254	<para>The Linux kernel has a different but equally horrible hack
				2255	for detecting signal handler returns. Discovering it is left as
				2256	an exercise for the reader.</para>
				2257
				2258	</sect2>
				2259
				2260
				2261	<sect2 id="mc-tech-docs.todo">
				2262	<title>To be written</title>
				2263
				2264	<para>The following is a list of as-yet-not-written stuff. Apologies.</para>
				2265	<orderedlist>
				2266	<listitem>
				2267	<para>The translation cache and translation table</para>
				2268	</listitem>
				2269	<listitem>
				2270	<para>Exceptions, creating new translations</para>
				2271	</listitem>
				2272	<listitem>
				2273	<para>Self-modifying code</para>
				2274	</listitem>
				2275	<listitem>
				2276	<para>Errors, error contexts, error reporting, suppressions</para>
				2277	</listitem>
				2278	<listitem>
				2279	<para>Client malloc/free</para>
				2280	</listitem>
				2281	<listitem>
				2282	<para>Low-level memory management</para>
				2283	</listitem>
				2284	<listitem>
				2285	<para>A and V bitmaps</para>
				2286	</listitem>
				2287	<listitem>
				2288	<para>Symbol table management</para>
				2289	</listitem>
				2290	<listitem>
				2291	<para>Dealing with system calls</para>
				2292	</listitem>
				2293	<listitem>
				2294	<para>Namespace management</para>
				2295	</listitem>
				2296	<listitem>
				2297	<para>GDB attaching</para>
				2298	</listitem>
				2299	<listitem>
				2300	<para>Non-dependence on glibc or anything else</para>
				2301	</listitem>
				2302	<listitem>
				2303	<para>The leak detector</para>
				2304	</listitem>
				2305	<listitem>
				2306	<para>Performance problems</para>
				2307	</listitem>
				2308	<listitem>
				2309	<para>Continuous sanity checking</para>
				2310	</listitem>
				2311	<listitem>
				2312	<para>Tracing, or not tracing, child processes</para>
				2313	</listitem>
				2314	<listitem>
				2315	<para>Assembly glue for syscalls</para>
				2316	</listitem>
				2317	</orderedlist>
				2318
				2319	</sect2>
				2320
				2321	</sect1>
				2322
				2323
				2324
				2325
				2326	<sect1 id="mc-tech-docs.extensions" xreflabel="Extensions">
				2327	<title>Extensions</title>
				2328
				2329	<para>Some comments about Stuff To Do.</para>
				2330
				2331	<sect2 id="mc-tech-docs.bugs" xreflabel="Bugs">
				2332	<title>Bugs</title>
				2333
				2334	<para>Stephan Kulow and Marc Mutz report problems with kmail in
				2335	KDE 3 CVS (RC2 ish) when run on Valgrind. Stephan has it
				2336	deadlocking; Marc has it looping at startup. I can't repro
				2337	either behaviour. Needs repro-ing and fixing.</para>
				2338
				2339	</sect2>
				2340
				2341
				2342	<sect2 id="mc-tech-docs.threads" xreflabel="Threads">
				2343	<title>Threads</title>
				2344
				2345	<para>Doing a good job of thread support strikes me as almost a
				2346	research-level problem. The central issues are how to do fast
				2347	cheap locking of the
				2348	<computeroutput>VG_(primary_map)</computeroutput> structure,
				2349	whether or not accesses to the individual secondary maps need
				2350	locking, what race-condition issues result, and whether the
				2351	already-nasty mess that is the signal simulator needs further
				2352	hackery.</para>
				2353
				2354	<para>I realise that threads are the most-frequently-requested
				2355	feature, and I am thinking about it all. If you have guru-level
				2356	understanding of fast mutual exclusion mechanisms and race
				2357	conditions, I would be interested in hearing from you.</para>
				2358
				2359	</sect2>
				2360
				2361
				2362
				2363	<sect2 id="mc-tech-docs.verify" xreflabel="Verification suite">
				2364	<title>Verification suite</title>
				2365
				2366	<para>Directory <computeroutput>tests/</computeroutput> contains
				2367	various ad-hoc tests for Valgrind. However, there is no
				2368	systematic verification or regression suite, that, for example,
				2369	exercises all the stuff in <filename>vg_memory.c</filename>, to
				2370	ensure that illegal memory accesses and undefined value uses are
				2371	detected as they should be. It would be good to have such a
				2372	suite.</para>
				2373
				2374	</sect2>
				2375
				2376
				2377	<sect2 id="mc-tech-docs.porting" xreflabel="Porting to other platforms">
				2378	<title>Porting to other platforms</title>
				2379
				2380	<para>It would be great if Valgrind was ported to FreeBSD and x86
				2381	NetBSD, and to x86 OpenBSD, if it's possible (doesn't OpenBSD use
				2382	a.out-style executables, not ELF ?)</para>
				2383
				2384	<para>The main difficulties, for an x86-ELF platform, seem to
				2385	be:</para>
				2386
				2387	<itemizedlist>
				2388
				2389	<listitem>
				2390	<para>You'd need to rewrite the
				2391	<computeroutput>/proc/self/maps</computeroutput> parser
				2392	(<filename>vg_procselfmaps.c</filename>). Easy.</para>
				2393	</listitem>
				2394
				2395	<listitem>
				2396	<para>You'd need to rewrite
				2397	<filename>vg_syscall_mem.c</filename>, or, more specifically,
				2398	provide one for your OS. This is tedious, but you can
				2399	implement syscalls on demand, and the Linux kernel interface
				2400	is, for the most part, going to look very similar to the *BSD
				2401	interfaces, so it's really a copy-paste-and-modify-on-demand
				2402	job. As part of this, you'd need to supply a new
				2403	<filename>vg_kerneliface.h</filename> file.</para>
				2404	</listitem>
				2405
				2406	<listitem>
				2407	<para>You'd also need to change the syscall wrappers for
				2408	Valgrind's internal use, in
				2409	<filename>vg_mylibc.c</filename>.</para>
				2410	</listitem>
				2411
				2412	</itemizedlist>
				2413
				2414	<para>All in all, I think a port to x86-ELF *BSDs is not really
				2415	very difficult, and in some ways I would like to see it happen,
				2416	because that would force a more clear factoring of Valgrind into
				2417	platform dependent and independent pieces. Not to mention, *BSD
				2418	folks also deserve to use Valgrind just as much as the Linux crew
				2419	do.</para>
				2420
				2421	</sect2>
				2422
				2423	</sect1>
				2424
				2425
				2426
				2427	<sect1 id="mc-tech-docs.easystuff"
				2428	xreflabel="Easy stuff which ought to be done">
				2429	<title>Easy stuff which ought to be done</title>
				2430
				2431
				2432	<sect2 id="mc-tech-docs.mmx" xreflabel="MMX Instructions">
				2433	<title>MMX Instructions</title>
				2434
				2435	<para>MMX insns should be supported, using the same trick as for
				2436	FPU insns. If the MMX registers are not used to copy
				2437	uninitialised junk from one place to another in memory, this
				2438	means we don't have to actually simulate the internal MMX unit
				2439	state, so the FPU hack applies. This should be fairly
				2440	easy.</para>
				2441
				2442	</sect2>
				2443
				2444
				2445	<sect2 id="mc-tech-docs.fixstabs" xreflabel="Fix stabs-info Reader">
				2446	<title>Fix stabs-info reader</title>
				2447
				2448	<para>The machinery in <filename>vg_symtab2.c</filename> which
				2449	reads "stabs" style debugging info is pretty weak. It usually
				2450	correctly translates simulated program counter values into line
				2451	numbers and procedure names, but the file name is often
				2452	completely wrong. I think the logic used to parse "stabs"
				2453	entries is weak. It should be fixed. The simplest solution,
				2454	IMO, is to copy either the logic or simply the code out of GNU
				2455	binutils which does this; since GDB can clearly get it right,
				2456	binutils (or GDB?) must have code to do this somewhere.</para>
				2457
				2458	</sect2>
				2459
				2460
				2461
				2462	<sect2 id="mc-tech-docs.x86instr" xreflabel="BT/BTC/BTS/BTR">
				2463	<title>BT/BTC/BTS/BTR</title>
				2464
				2465	<para>These are x86 instructions which test, complement, set, or
				2466	reset, a single bit in a word. At the moment they are both
				2467	incorrectly implemented and incorrectly instrumented.</para>
				2468
				2469	<para>The incorrect instrumentation is due to use of helper
				2470	functions. This means we lose bit-level definedness tracking,
				2471	which could wind up giving spurious uninitialised-value use
				2472	errors. The Right Thing to do is to invent a couple of new
				2473	UOpcodes, I think <computeroutput>GET_BIT</computeroutput> and
				2474	<computeroutput>SET_BIT</computeroutput>, which can be used to
				2475	implement all 4 x86 insns, get rid of the helpers, and give
				2476	bit-accurate instrumentation rules for the two new
				2477	UOpcodes.</para>
				2478
				2479	<para>I realised the other day that they are mis-implemented too.
				2480	The x86 insns take a bit-index and a register or memory location
				2481	to access. For registers the bit index clearly can only be in
				2482	the range zero to register-width minus 1, and I assumed the same
				2483	applied to memory locations too. But evidently not; for memory
				2484	locations the index can be arbitrary, and the processor will
				2485	index arbitrarily into memory as a result. This too should be
				2486	fixed. Sigh. Presumably indexing outside the immediate word is
				2487	not actually used by any programs yet tested on Valgrind, for
				2488	otherwise they (presumably) would simply not work at all. If you
				2489	plan to hack on this, first check the Intel docs to make sure my
				2490	understanding is really correct.</para>
				2491
				2492	</sect2>
				2493
				2494
				2495	<sect2 id="mc-tech-docs.prefetch" xreflabel="Using PREFETCH Instructions">
				2496	<title>Using PREFETCH Instructions</title>
				2497
				2498	<para>Here's a small but potentially interesting project for
				2499	performance junkies. Experiments with valgrind's code generator
				2500	and optimiser(s) suggest that reducing the number of instructions
				2501	executed in the translations and mem-check helpers gives
				2502	disappointingly small performance improvements. Perhaps this is
				2503	because performance of Valgrindified code is limited by cache
				2504	misses. After all, each read in the original program now gives
				2505	rise to at least three reads, one for the
				2506	<computeroutput>VG_(primary_map)</computeroutput>, one of the
				2507	resulting secondary, and the original. Not to mention, the
				2508	instrumented translations are 13 to 14 times larger than the
				2509	originals. All in all one would expect the memory system to be
				2510	hammered to hell and then some.</para>
				2511
				2512	<para>So here's an idea. An x86 insn involving a read from
				2513	memory, after instrumentation, will turn into ucode of the
				2514	following form:</para>
				2515	<programlisting><![CDATA[
				2516	... calculate effective addr, into ta and qa ...
				2517	TESTVL qa -- is the addr defined?
				2518	LOADV (ta), qloaded -- fetch V bits for the addr
				2519	LOAD (ta), tloaded -- do the original load]]></programlisting>
				2520
				2521	<para>At the point where the
				2522	<computeroutput>LOADV</computeroutput> is done, we know the
				2523	actual address (<computeroutput>ta</computeroutput>) from which
				2524	the real <computeroutput>LOAD</computeroutput> will be done. We
				2525	also know that the <computeroutput>LOADV</computeroutput> will
				2526	take around 20 x86 insns to do. So it seems plausible that doing
				2527	a prefetch of <computeroutput>ta</computeroutput> just before the
				2528	<computeroutput>LOADV</computeroutput> might just avoid a miss at
				2529	the <computeroutput>LOAD</computeroutput> point, and that might
				2530	be a significant performance win.</para>
				2531
				2532	<para>Prefetch insns are notoriously tempermental, more often
				2533	than not making things worse rather than better, so this would
				2534	require considerable fiddling around. It's complicated because
				2535	Intels and AMDs have different prefetch insns with different
				2536	semantics, so that too needs to be taken into account. As a
				2537	general rule, even placing the prefetches before the
				2538	<computeroutput>LOADV</computeroutput> insn is too near the
				2539	<computeroutput>LOAD</computeroutput>; the ideal distance is
				2540	apparently circa 200 CPU cycles. So it might be worth having
				2541	another analysis/transformation pass which pushes prefetches as
				2542	far back as possible, hopefully immediately after the effective
				2543	address becomes available.</para>
				2544
				2545	<para>Doing too many prefetches is also bad because they soak up
				2546	bus bandwidth / cpu resources, so some cleverness in deciding
				2547	which loads to prefetch and which to not might be helpful. One
				2548	can imagine not prefetching client-stack-relative
				2549	(<computeroutput>%EBP</computeroutput> or
				2550	<computeroutput>%ESP</computeroutput>) accesses, since the stack
				2551	in general tends to show good locality anyway.</para>
				2552
				2553	<para>There's quite a lot of experimentation to do here, but I
				2554	think it might make an interesting week's work for
				2555	someone.</para>
				2556
				2557	<para>As of 15-ish March 2002, I've started to experiment with
				2558	this, using the AMD
				2559	<computeroutput>prefetch/prefetchw</computeroutput> insns.</para>
				2560
				2561	</sect2>
				2562
				2563
				2564	<sect2 id="mc-tech-docs.pranges" xreflabel="User-defined Permission Ranges">
				2565	<title>User-defined Permission Ranges</title>
				2566
				2567	<para>This is quite a large project -- perhaps a month's hacking
				2568	for a capable hacker to do a good job -- but it's potentially
				2569	very interesting. The outcome would be that Valgrind could
				2570	detect a whole class of bugs which it currently cannot.</para>
				2571
				2572	<para>The presentation falls into two pieces.</para>
				2573
				2574	<sect3 id="mc-tech-docs.psetting"
				2575	xreflabel="Part 1: User-defined Address-range Permission Setting">
				2576	<title>Part 1: User-defined Address-range Permission Setting</title>
				2577
				2578	<para>Valgrind intercepts the client's
				2579	<computeroutput>malloc</computeroutput>,
				2580	<computeroutput>free</computeroutput>, etc calls, watches system
				2581	calls, and watches the stack pointer move. This is currently the
				2582	only way it knows about which addresses are valid and which not.
				2583	Sometimes the client program knows extra information about its
				2584	memory areas. For example, the client could at some point know
				2585	that all elements of an array are out-of-date. We would like to
				2586	be able to convey to Valgrind this information that the array is
				2587	now addressable-but-uninitialised, so that Valgrind can then warn
				2588	if elements are used before they get new values.</para>
				2589
				2590	<para>What I would like are some macros like this:</para>
				2591	<programlisting><![CDATA[
				2592	VALGRIND_MAKE_NOACCESS(addr, len)
				2593	VALGRIND_MAKE_WRITABLE(addr, len)
				2594	VALGRIND_MAKE_READABLE(addr, len)]]></programlisting>
				2595
				2596	<para>and also, to check that memory is
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	2597	addressable/initialised,</para>
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	2598	<programlisting><![CDATA[
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	2599	VALGRIND_CHECK_ADDRESSABLE(addr, len)
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	2600	VALGRIND_CHECK_INITIALISED(addr, len)]]></programlisting>
				2601
				2602	<para>I then include in my sources a header defining these
				2603	macros, rebuild my app, run under Valgrind, and get user-defined
				2604	checks.</para>
				2605
				2606	<para>Now here's a neat trick. It's a nuisance to have to
				2607	re-link the app with some new library which implements the above
				2608	macros. So the idea is to define the macros so that the
				2609	resulting executable is still completely stand-alone, and can be
				2610	run without Valgrind, in which case the macros do nothing, but
				2611	when run on Valgrind, the Right Thing happens. How to do this?
				2612	The idea is for these macros to turn into a piece of inline
				2613	assembly code, which (1) has no effect when run on the real CPU,
				2614	(2) is easily spotted by Valgrind's JITter, and (3) no sane
				2615	person would ever write, which is important for avoiding false
				2616	matches in (2). So here's a suggestion:</para>
				2617	<programlisting><![CDATA[
				2618	VALGRIND_MAKE_NOACCESS(addr, len)]]></programlisting>
				2619
				2620	<para>becomes (roughly speaking)</para>
				2621	<programlisting><![CDATA[
				2622	movl addr, %eax
				2623	movl len, %ebx
				2624	movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be
				2625	-- 2, etc
				2626	rorl $13, %ecx
				2627	rorl $19, %ecx
				2628	rorl $11, %eax
				2629	rorl $21, %eax]]></programlisting>
				2630
				2631	<para>The rotate sequences have no effect, and it's unlikely they
				2632	would appear for any other reason, but they define a unique
				2633	byte-sequence which the JITter can easily spot. Using the
				2634	operand constraints section at the end of a gcc inline-assembly
				2635	statement, we can tell gcc that the assembly fragment kills
				2636	<computeroutput>%eax</computeroutput>,
				2637	<computeroutput>%ebx</computeroutput>,
				2638	<computeroutput>%ecx</computeroutput> and the condition codes, so
				2639	this fragment is made harmless when not running on Valgrind, runs
				2640	quickly when not on Valgrind, and does not require any other
				2641	library support.</para>
				2642
				2643
				2644	</sect3>
				2645
				2646
				2647	<sect3 id="mc-tech-docs.prange-detect"
				2648	xreflabel="Part 2: Using it to detect Interference between Stack
				2649	Variables">
				2650	<title>Part 2: Using it to detect Interference between Stack
				2651	Variables</title>
				2652
				2653	<para>Currently Valgrind cannot detect errors of the following
				2654	form:</para>
				2655	<programlisting><![CDATA[
				2656	void fooble ( void )
				2657	{
				2658	int a[10];
				2659	int b[10];
				2660	a[10] = 99;
				2661	}]]></programlisting>
				2662
				2663	<para>Now imagine rewriting this as</para>
				2664	<programlisting><![CDATA[
				2665	void fooble ( void )
				2666	{
				2667	int spacer0;
				2668	int a[10];
				2669	int spacer1;
				2670	int b[10];
				2671	int spacer2;
				2672	VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
				2673	VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
				2674	VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
				2675	a[10] = 99;
				2676	}]]></programlisting>
				2677
				2678	<para>Now the invalid write is certain to hit
				2679	<computeroutput>spacer0</computeroutput> or
				2680	<computeroutput>spacer1</computeroutput>, so Valgrind will spot
				2681	the error.</para>
				2682
				2683	<para>There are two complications.</para>
				2684
				2685	<orderedlist>
				2686
				2687	<listitem>
				2688	<para>The first is that we don't want to annotate sources by
				2689	hand, so the Right Thing to do is to write a C/C++ parser,
				2690	annotator, prettyprinter which does this automatically, and
de	97ab7e7	2005-11-27 18:19:40 +0000	[diff] [blame]	2691	run it on post-CPP'd C/C++ source. The parser/prettyprinter
				2692	is probably not as hard as it sounds; I would write it in Haskell,
				2693	a powerful functional language well suited to doing symbolic
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	2694	computation, with which I am intimately familiar. There is
njn	3e986b2	2004-11-30 10:43:45 +0000	[diff] [blame]	2695	already a C parser written in Haskell by someone in the
				2696	Haskell community, and that would probably be a good starting
				2697	point.</para>
				2698	</listitem>
				2699
				2700
				2701	<listitem>
				2702	<para>The second complication is how to get rid of these
				2703	<computeroutput>NOACCESS</computeroutput> records inside
				2704	Valgrind when the instrumented function exits; after all,
				2705	these refer to stack addresses and will make no sense
				2706	whatever when some other function happens to re-use the same
				2707	stack address range, probably shortly afterwards. I think I
				2708	would be inclined to define a special stack-specific
				2709	macro:</para>
				2710	<programlisting><![CDATA[
				2711	VALGRIND_MAKE_NOACCESS_STACK(addr, len)]]></programlisting>
				2712	<para>which causes Valgrind to record the client's
				2713	<computeroutput>%ESP</computeroutput> at the time it is
				2714	executed. Valgrind will then watch for changes in
				2715	<computeroutput>%ESP</computeroutput> and discard such
				2716	records as soon as the protected area is uncovered by an
				2717	increase in <computeroutput>%ESP</computeroutput>. I
				2718	hesitate with this scheme only because it is potentially
				2719	expensive, if there are hundreds of such records, and
				2720	considering that changes in
				2721	<computeroutput>%ESP</computeroutput> already require
				2722	expensive messing with stack access permissions.</para>
				2723	</listitem>
				2724	</orderedlist>
				2725
				2726	<para>This is probably easier and more robust than for the
				2727	instrumenter program to try and spot all exit points for the
				2728	procedure and place suitable deallocation annotations there.
				2729	Plus C++ procedures can bomb out at any point if they get an
				2730	exception, so spotting return points at the source level just
				2731	won't work at all.</para>
				2732
				2733	<para>Although some work, it's all eminently doable, and it would
				2734	make Valgrind into an even-more-useful tool.</para>
				2735
				2736	</sect3>
				2737
				2738	</sect2>
				2739
				2740	</sect1>
				2741	</chapter>