Blame - helgrind/docs/hg-manual.xml - platform/external/valgrind

blob: 4197fa41be4692990bca99d675ca6ccaec901a93 [file] [log] [blame]

sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
				4	[ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	5
				6
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	7	<chapter id="hg-manual" xreflabel="Helgrind: thread error detector">
				8	<title>Helgrind: a thread error detector</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	9
				10	<para>To use this tool, you must specify
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	11	<computeroutput>--tool=helgrind</computeroutput> on the Valgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	12	command line.</para>
				13
				14
				15
				16
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	17	<sect1 id="hg-manual.overview" xreflabel="Overview">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	18	<title>Overview</title>
				19
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	20	<para>Helgrind is a Valgrind tool for detecting synchronisation errors
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	21	in C, C++ and Fortran programs that use the POSIX pthreads
				22	threading primitives.</para>
				23
				24	<para>The main abstractions in POSIX pthreads are: a set of threads
				25	sharing a common address space, thread creation, thread joinage,
				26	thread exit, mutexes (locks), condition variables (inter-thread event
				27	notifications), reader-writer locks, and semaphores.</para>
				28
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	29	<para>Helgrind is aware of all these abstractions and tracks their
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	30	effects as accurately as it can. Currently it does not correctly
				31	handle pthread barriers and pthread spinlocks, although it will not
				32	object if you use them. On x86 and amd64 platforms, it understands
				33	and partially handles implicit locking arising from the use of the
				34	LOCK instruction prefix.
				35	</para>
				36
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	37	<para>Helgrind can detect three classes of errors, which are discussed
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	38	in detail in the next three sections:</para>
				39
				40	<orderedlist>
				41	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	42	<para><link linkend="hg-manual.api-checks">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	43	Misuses of the POSIX pthreads API.</link></para>
				44	</listitem>
				45	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	46	<para><link linkend="hg-manual.lock-orders">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	47	Potential deadlocks arising from lock
				48	ordering problems.</link></para>
				49	</listitem>
				50	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	51	<para><link linkend="hg-manual.data-races">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	52	Data races -- accessing memory without adequate locking.
				53	</link></para>
				54	</listitem>
				55	</orderedlist>
				56
				57	<para>Following those is a section containing
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	58	<link linkend="hg-manual.effective-use">
				59	hints and tips on how to get the best out of Helgrind.</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	60	</para>
				61
				62	<para>Then there is a
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	63	<link linkend="hg-manual.options">summary of command-line
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	64	options.</link>
				65	</para>
				66
				67	<para>Finally, there is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	68	<link linkend="hg-manual.todolist">a brief summary of areas in which Helgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	69	could be improved.</link>
				70	</para>
				71
				72	</sect1>
				73
				74
				75
				76
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	77	<sect1 id="hg-manual.api-checks" xreflabel="API Checks">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	78	<title>Detected errors: Misuses of the POSIX pthreads API</title>
				79
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	80	<para>Helgrind intercepts calls to many POSIX pthreads functions, and
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	81	is therefore able to report on various common problems. Although
				82	these are unglamourous errors, their presence can lead to undefined
				83	program behaviour and hard-to-find bugs later in execution. The
				84	detected errors are:</para>
				85
				86	<itemizedlist>
				87	<listitem><para>unlocking an invalid mutex</para></listitem>
				88	<listitem><para>unlocking a not-locked mutex</para></listitem>
				89	<listitem><para>unlocking a mutex held by a different
				90	thread</para></listitem>
				91	<listitem><para>destroying an invalid or a locked mutex</para></listitem>
				92	<listitem><para>recursively locking a non-recursive mutex</para></listitem>
				93	<listitem><para>deallocation of memory that contains a
				94	locked mutex</para></listitem>
				95	<listitem><para>passing mutex arguments to functions expecting
				96	reader-writer lock arguments, and vice
				97	versa</para></listitem>
				98	<listitem><para>when a POSIX pthread function fails with an
				99	error code that must be handled</para></listitem>
				100	<listitem><para>when a thread exits whilst still holding locked
				101	locks</para></listitem>
				102	<listitem><para>calling <computeroutput>pthread_cond_wait</computeroutput>
				103	with a not-locked mutex, or one locked by a different
				104	thread</para></listitem>
				105	</itemizedlist>
				106
				107	<para>Checks pertaining to the validity of mutexes are generally also
				108	performed for reader-writer locks.</para>
				109
				110	<para>Various kinds of this-can't-possibly-happen events are also
				111	reported. These usually indicate bugs in the system threading
				112	library.</para>
				113
				114	<para>Reported errors always contain a primary stack trace indicating
				115	where the error was detected. They may also contain auxiliary stack
				116	traces giving additional information. In particular, most errors
				117	relating to mutexes will also tell you where that mutex first came to
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	118	Helgrind's attention (the "<computeroutput>was first observed
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	119	at</computeroutput>" part), so you have a chance of figuring out which
				120	mutex it is referring to. For example:</para>
				121
				122	<programlisting><![CDATA[
				123	Thread #1 unlocked a not-locked lock at 0x7FEFFFA90
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	124	at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.c:492)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	125	by 0x40073A: nearly_main (tc09_bad_unlock.c:27)
				126	by 0x40079B: main (tc09_bad_unlock.c:50)
				127	Lock at 0x7FEFFFA90 was first observed
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	128	at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	129	by 0x40071F: nearly_main (tc09_bad_unlock.c:23)
				130	by 0x40079B: main (tc09_bad_unlock.c:50)
				131	]]></programlisting>
				132
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	133	<para>Helgrind has a way of summarising thread identities, as
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	134	evidenced here by the text "<computeroutput>Thread
				135	#1</computeroutput>". This is so that it can speak about threads and
				136	sets of threads without overwhelming you with details. See
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	137	<link linkend="hg-manual.data-races.errmsgs">below</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	138	for more information on interpreting error messages.</para>
				139
				140	</sect1>
				141
				142
				143
				144
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	145	<sect1 id="hg-manual.lock-orders" xreflabel="Lock Orders">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	146	<title>Detected errors: Inconsistent Lock Orderings</title>
				147
				148	<para>In this section, and in general, to "acquire" a lock simply
				149	means to lock that lock, and to "release" a lock means to unlock
				150	it.</para>
				151
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	152	<para>Helgrind monitors the order in which threads acquire locks.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	153	This allows it to detect potential deadlocks which could arise from
				154	the formation of cycles of locks. Detecting such inconsistencies is
				155	useful because, whilst actual deadlocks are fairly obvious, potential
				156	deadlocks may never be discovered during testing and could later lead
				157	to hard-to-diagnose in-service failures.</para>
				158
				159	<para>The simplest example of such a problem is as
				160	follows.</para>
				161
				162	<itemizedlist>
				163	<listitem><para>Imagine some shared resource R, which, for whatever
				164	reason, is guarded by two locks, L1 and L2, which must both be held
				165	when R is accessed.</para>
				166	</listitem>
				167	<listitem><para>Suppose a thread acquires L1, then L2, and proceeds
				168	to access R. The implication of this is that all threads in the
				169	program must acquire the two locks in the order first L1 then L2.
				170	Not doing so risks deadlock.</para>
				171	</listitem>
				172	<listitem><para>The deadlock could happen if two threads -- call them
				173	T1 and T2 -- both want to access R. Suppose T1 acquires L1 first,
				174	and T2 acquires L2 first. Then T1 tries to acquire L2, and T2 tries
				175	to acquire L1, but those locks are both already held. So T1 and T2
				176	become deadlocked.</para>
				177	</listitem>
				178	</itemizedlist>
				179
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	180	<para>Helgrind builds a directed graph indicating the order in which
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	181	locks have been acquired in the past. When a thread acquires a new
				182	lock, the graph is updated, and then checked to see if it now contains
				183	a cycle. The presence of a cycle indicates a potential deadlock involving
				184	the locks in the cycle.</para>
				185
				186	<para>In simple situations, where the cycle only contains two locks,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	187	Helgrind will show where the required order was established:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	188
				189	<programlisting><![CDATA[
				190	Thread #1: lock order "0x7FEFFFAB0 before 0x7FEFFFA80" violated
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	191	at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	192	by 0x40081F: main (tc13_laog1.c:24)
				193	Required order was established by acquisition of lock at 0x7FEFFFAB0
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	194	at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	195	by 0x400748: main (tc13_laog1.c:17)
				196	followed by a later acquisition of lock at 0x7FEFFFA80
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	197	at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	198	by 0x400773: main (tc13_laog1.c:18)
				199	]]></programlisting>
				200
				201	<para>When there are more than two locks in the cycle, the error is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	202	equally serious. However, at present Helgrind does not show the locks
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	203	involved, so as to avoid flooding you with information. That could be
				204	fixed in future. For example, here is a an example involving a cycle
				205	of five locks from a naive implementation the famous Dining
				206	Philosophers problem
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	207	(see <computeroutput>helgrind/tests/tc14_laog_dinphils.c</computeroutput>).
				208	In this case Helgrind has detected that all 5 philosophers could
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	209	simultaneously pick up their left fork and then deadlock whilst
				210	waiting to pick up their right forks.</para>
				211
				212	<programlisting><![CDATA[
				213	Thread #6: lock order "0x6010C0 before 0x601160" violated
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	214	at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	215	by 0x4007C0: dine (tc14_laog_dinphils.c:19)
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	216	by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	217	by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so)
				218	by 0x51054CC: clone (in /lib64/libc-2.5.so)
				219	]]></programlisting>
				220
				221	</sect1>
				222
				223
				224
				225
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	226	<sect1 id="hg-manual.data-races" xreflabel="Data Races">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	227	<title>Detected errors: Data Races</title>
				228
				229	<para>A data race happens, or could happen, when two threads
				230	access a shared memory location without using suitable locks to
				231	ensure single-threaded access. Such missing locking can cause
				232	obscure timing dependent bugs. Ensuring programs are race-free is
				233	one of the central difficulties of threaded programming.</para>
				234
				235	<para>Reliably detecting races is a difficult problem, and most
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	236	of Helgrind's internals are devoted to do dealing with it.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	237	As a consequence this section is somewhat long and involved.
				238	We begin with a simple example.</para>
				239
				240
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	241	<sect2 id="hg-manual.data-races.example" xreflabel="Simple Race">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	242	<title>A Simple Data Race</title>
				243
				244	<para>About the simplest possible example of a race is as follows. In
				245	this program, it is impossible to know what the value
				246	of <computeroutput>var</computeroutput> is at the end of the program.
				247	Is it 2 ? Or 1 ?</para>
				248
				249	<programlisting><![CDATA[
				250	#include <pthread.h>
				251
				252	int var = 0;
				253
				254	void* child_fn ( void* arg ) {
				255	var++; /* Unprotected relative to parent / / this is line 6 */
				256	return NULL;
				257	}
				258
				259	int main ( void ) {
				260	pthread_t child;
				261	pthread_create(&child, NULL, child_fn, NULL);
				262	var++; /* Unprotected relative to child / / this is line 13 */
				263	pthread_join(child, NULL);
				264	return 0;
				265	}
				266	]]></programlisting>
				267
				268	<para>The problem is there is nothing to
				269	stop <computeroutput>var</computeroutput> being updated simultaneously
				270	by both threads. A correct program would
				271	protect <computeroutput>var</computeroutput> with a lock of type
				272	<computeroutput>pthread_mutex_t</computeroutput>, which is acquired
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	273	before each access and released afterwards. Helgrind's output for
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	274	this program is:</para>
				275
				276	<programlisting><![CDATA[
				277	Thread #1 is the program's root thread
				278
				279	Thread #2 was created
				280	at 0x510548E: clone (in /lib64/libc-2.5.so)
				281	by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so)
				282	by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	283	by 0x4C23870: pthread_create@* (hg_intercepts.c:198)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	284	by 0x4005F1: main (simple_race.c:12)
				285
				286	Possible data race during write of size 4 at 0x601034
				287	at 0x4005F2: main (simple_race.c:13)
				288	Old state: shared-readonly by threads #1, #2
				289	New state: shared-modified by threads #1, #2
				290	Reason: this thread, #1, holds no consistent locks
				291	Location 0x601034 has never been protected by any lock
				292	]]></programlisting>
				293
				294	<para>This is quite a lot of detail for an apparently simple error.
				295	The last clause is the main error message. It says there is a race as
				296	a result of a write of size 4 (bytes), at 0x601034, which is
				297	presumably the address of <computeroutput>var</computeroutput>,
				298	happening in function <computeroutput>main</computeroutput> at line 13
				299	in the program.</para>
				300
				301	<para>Note that it is purely by chance that the race is
				302	reported for the parent thread's access. It could equally have been
				303	reported instead for the child's access, at line 6. The error will
				304	only be reported for one of the locations, since neither the parent
				305	nor child is, by itself, incorrect. It is only when both access
				306	<computeroutput>var</computeroutput> without a lock that an error
				307	exists.</para>
				308
				309	<para>The error message shows some other interesting details. The
				310	sections below explain them. Here we merely note their presence:</para>
				311
				312	<itemizedlist>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	313	<listitem><para>Helgrind maintains some kind of state machine for the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	314	memory location in question, hence the "<computeroutput>Old
				315	state:</computeroutput>" and "<computeroutput>New
				316	state:</computeroutput>" lines.</para>
				317	</listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	318	<listitem><para>Helgrind keeps track of which threads have accessed
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	319	the location: "<computeroutput>threads #1, #2</computeroutput>".
				320	Before printing the main error message, it prints the creation
				321	points of these two threads, so you can see which threads it is
				322	referring to.</para>
				323	</listitem>
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	324	<listitem><para>Helgrind tries to provide an explanation of why the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	325	race exists: "<computeroutput>Location 0x601034 has never been
				326	protected by any lock</computeroutput>".</para>
				327	</listitem>
				328	</itemizedlist>
				329
				330	<para>Understanding the memory state machine is central to
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	331	understanding Helgrind's race-detection algorithm. The next three
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	332	subsections explain this.</para>
				333
				334	</sect2>
				335
				336
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	337	<sect2 id="hg-manual.data-races.memstates" xreflabel="Memory States">
				338	<title>Helgrind's Memory State Machine</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	339
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	340	<para>Helgrind tracks the state of every byte of memory used by your
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	341	program. There are a number of states, but only three are
				342	interesting:</para>
				343
				344	<itemizedlist>
				345	<listitem><para>Exclusive: memory in this state is regarded as owned
				346	exclusively by one particular thread. That thread may read and
				347	write it without a lock. Even in highly threaded programs, the
				348	majority of locations never leave the Exclusive state, since most
				349	data is thread-private.</para>
				350	</listitem>
				351	<listitem><para>Shared-Readonly: memory in this state is regarded as
				352	shared by multiple threads. In this state, any thread may read the
				353	memory without a lock, reflecting the fact that readonly data may
				354	safely be shared between threads without locking.</para>
				355	</listitem>
				356	<listitem><para>Shared-Modified: memory in this state is regarded as
				357	shared by multiple threads, at least one of which has written to it.
				358	All participating threads must hold at least one lock in common when
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	359	accessing the memory. If no such lock exists, Helgrind reports a
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	360	race error.</para>
				361	</listitem>
				362	</itemizedlist>
				363
				364	<para>Let's review the simple example above with this in mind. When
				365	the program starts, <computeroutput>var</computeroutput> is not in any
				366	of these states. Either the parent or child thread gets to its
				367	<computeroutput>var++</computeroutput> first, and thereby
				368	thereby gets Exclusive ownership of the location.</para>
				369
				370	<para>The later-running thread now arrives at
				371	its <computeroutput>var++</computeroutput> statement. It first reads
				372	the existing value from memory.
				373	Because <computeroutput>var</computeroutput> is currently marked as
				374	owned exclusively by the other thread, its state is changed to
				375	shared-readonly by both threads.</para>
				376
				377	<para>This same thread adds one to the value it has and stores it back
				378	in <computeroutput>var</computeroutput>. This causes another state
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	379	change, this time to the shared-modified state. Because Helgrind has
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	380	also been tracking which threads hold which locks, it can see that
				381	<computeroutput>var</computeroutput> is in shared-modified state but
				382	no lock has been used to consistently protect it. Hence a race is
				383	reported exactly at the transition from shared-readonly to
				384	shared-modified.</para>
				385
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	386	<para>The essence of the algorithm is this. Helgrind keeps track of
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	387	each memory location that has been accessed by more than one thread.
				388	For each such location it incrementally infers the set of locks which
				389	have consistently been used to protect that location. If the
				390	location's lockset becomes empty, and at some point one of the threads
				391	attempts to write to it, a race is then reported.</para>
				392
				393	<para>This technique is known as "lockset inference" and was
				394	introduced in: "Eraser: A Dynamic Data Race Detector for Multithreaded
				395	Programs" (Stefan Savage, Michael Burrows, Greg Nelson, Patrick
				396	Sobalvarro and Thomas Anderson, ACM Transactions on Computer Systems,
				397	15(4):391-411, November 1997).</para>
				398
				399	<para>Lockset inference has since been widely implemented, studied and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	400	extended. Helgrind incorporates several refinements aimed at avoiding
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	401	the high false error rate that naive versions of the algorithm suffer
				402	from. A
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	403	<link linkend="hg-manual.data-races.summary">summary of the complete
				404	algorithm used by Helgrind</link> is presented below. First, however,
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	405	it is important to understand details of transitions pertaining to the
				406	Exclusive-ownership state.</para>
				407
				408	</sect2>
				409
				410
				411
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	412	<sect2 id="hg-manual.data-races.exclusive" xreflabel="Excl Transfers">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	413	<title>Transfers of Exclusive Ownership Between Threads</title>
				414
				415	<para>As presented, the algorithm is far too strict. It reports many
				416	errors in perfectly correct, widely used parallel programming
				417	constructions, for example, using child worker threads and worker
				418	thread pools.</para>
				419
				420	<para>To avoid these false errors, we must refine the algorithm so
				421	that it keeps memory in an Exclusive ownership state in cases where it
				422	would otherwise decay into a shared-readonly or shared-modified state.
				423	Recall that Exclusive ownership is special in that it grants the
				424	owning thread the right to access memory without use of any locks. In
				425	order to support worker-thread and worker-thread-pool idioms, we will
				426	allow threads to steal exclusive ownership of memory from other
				427	threads under certain circumstances.</para>
				428
				429	<para>Here's an example. Imagine a parent thread creates child
				430	threads to do units of work. For each unit of work, the parent
				431	allocates a work buffer, fills it in, and creates the child thread,
				432	handing it a pointer to the buffer. The child reads/writes the buffer
				433	and eventually exits, and the waiting parent then extracts the results
				434	from the buffer:</para>
				435
				436	<programlisting><![CDATA[
				437	typedef ... Buffer;
				438
				439	pthread_t child;
				440	Buffer buf;
				441
				442	/* ---- Parent ---- / / ---- Child ---- */
				443
				444	/* parent writes workload into buf */
				445	pthread_create( &child, child_fn, &buf );
				446
				447	/* parent does not read / void child_fn ( Buffer buf ) {
				448	/* or write buf / / read/write buf */
				449	}
				450
				451	pthread_join ( child );
				452	/* parent reads results from buf */
				453	]]></programlisting>
				454
				455	<para>Although <computeroutput>buf</computeroutput> is accessed by
				456	both threads, neither uses locks, yet the program is race-free. The
				457	essential observation is that the child's creation and exit create
				458	synchronisation events between it and the parent. These force the
				459	child's accesses to <computeroutput>buf</computeroutput> to happen
				460	after the parent initialises <computeroutput>buf</computeroutput>, and
				461	before the parent reads the results
				462	from <computeroutput>buf</computeroutput>.</para>
				463
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	464	<para>To model this, Helgrind allows the child to steal, from the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	465	parent, exclusive ownership of any memory exclusively owned by the
				466	parent before the pthread_create call. Similarly, once the parent's
				467	pthread_join call returns, it can steal back ownership of memory
				468	exclusively owned by the child. In this way ownership
				469	of <computeroutput>buf</computeroutput> is transferred from parent to
				470	child and back, so the basic algorithm does not report any races
				471	despite the absence of any locking.</para>
				472
				473	<para>Note that the child may only steal memory owned by the parent
				474	prior to the pthread_create call. If the child attempts to read or
				475	write memory which is also accessed by the parent in between the
				476	pthread_create and pthread_join calls, an error is still
				477	reported.</para>
				478
				479	<para>This technique was introduced with the name "thread lifetime
				480	segments" in "Runtime Checking of Multithreaded Applications with
				481	Visual Threads" (Jerry J. Harrow, Jr, Proceedings of the 7th
				482	International SPIN Workshop on Model Checking of Software Stanford,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	483	California, USA, August 2000, LNCS 1885, pp331--342). Helgrind
				484	implements an extended version of it. Specifically, Helgrind allows
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	485	transfer of exclusive ownership in the following situations:</para>
				486
				487	<itemizedlist>
				488	<listitem><para>At thread creation: a child can acquire ownership of
				489	memory held exclusively by the parent prior to the child's
				490	creation.</para>
				491	</listitem>
				492	<listitem><para>At thread joining: the joiner (thread not exiting)
				493	can acquire ownership of memory held exclusively by the joinee
				494	(thread that is exiting) at the point it exited.</para>
				495	</listitem>
				496	<listitem><para>At condition variable signallings and broadcasts. A
				497	thread Tw which completes a pthread_cond_wait call as a result of
				498	a signal or broadcast on the same condition variable by some other
				499	thread Ts, may acquire ownership of memory held exclusively by
				500	Ts prior to the pthread_cond_signal/broadcast
				501	call.</para>
				502	</listitem>
				503	<listitem><para>At semaphore posts (sem_post) calls. A thread Tw
				504	which completes a sem_wait call call as a result of a sem_post call
				505	on the same semaphore by some other thread Tp, may acquire
				506	ownership of memory held exclusively by Tp prior to the sem_post
				507	call.</para>
				508	</listitem>
				509	</itemizedlist>
				510
				511	</sect2>
				512
				513
				514
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	515	<sect2 id="hg-manual.data-races.re-excl" xreflabel="Re-Excl Transfers">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	516	<title>Restoration of Exclusive Ownership</title>
				517
				518	<para>Another common idiom is to partition the lifetime of the program
				519	as a whole into several distinct phases. In some of those phases, a
				520	memory location may be accessed by multiple threads and so require
				521	locking. In other phases only one thread exists and so can access the
				522	memory without locking. For example:</para>
				523
				524	<programlisting><![CDATA[
				525	int var = 0; /* shared variable */
				526	pthread_mutex_t mx = PTHREAD_MUTEX_INITIALIZER; /* guard for var */
				527	pthread_t child;
				528
				529	/* ---- Parent ---- / / ---- Child ---- */
				530
				531	var += 1; /* no lock used */
				532
				533	pthread_create( &child, child_fn, NULL );
				534
				535	void child_fn ( void* uu ) {
				536	pthread_mutex_lock(&mx); pthread_mutex_lock(&mx);
				537	var += 2; var += 3;
				538	pthread_mutex_unlock(&mx); pthread_mutex_unlock(&mx);
				539	}
				540
				541	pthread_join ( child );
				542
				543	var += 4; /* no lock used */
				544	]]></programlisting>
				545
				546	<para>This program is correct, but using only the mechanisms described
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	547	so far, Helgrind would report an error at
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	548	<computeroutput>var += 4</computeroutput>. This is because, by that
				549	point, <computeroutput>var</computeroutput> is marked as being in the
				550	state "shared-modified and protected by the
				551	lock <computeroutput>mx</computeroutput>", but is being accessed
				552	without locking. Really, what we want is
				553	for <computeroutput>var</computeroutput> to return to the parent
				554	thread's exclusive ownership after the child thread has exited.</para>
				555
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	556	<para>To make this possible, for every memory location Helgrind also keeps
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	557	track of all the threads that have accessed that location
				558	-- its threadset. When a thread Tquitter joins back to Tstayer,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	559	Helgrind examines the locksets of all memory in shared-modified or
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	560	shared-readable state. In each such lockset, if Tquitter is
				561	mentioned, it is removed and replaced by Tstayer. If, as a result, a
				562	lockset becomes a singleton set containing Tstayer, then the
				563	location's state is changed to belongs-exclusively-to-Tstayer.</para>
				564
				565	<para>In our example, the result is exactly as we desire:
				566	<computeroutput>var</computeroutput> is reacquired exclusively by the
				567	parent after the child exits.</para>
				568
				569	<para>More generally, when a group of threads merges back to a single
				570	thread via a cascade of pthread_join calls, any memory shared by the
				571	group (or a subset of it) ends up being owned exclusively by the sole
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	572	surviving thread. This significantly enhances Helgrind's flexibility,
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	573	since it means that each memory location may make arbitrarily many
				574	transitions between exclusive and shared ownership. Furthermore, a
				575	different lock may protect the location during each period of shared
				576	ownership.</para>
				577
				578	</sect2>
				579
				580
				581
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	582	<sect2 id="hg-manual.data-races.summary" xreflabel="Race Det Summary">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	583	<title>A Summary of the Race Detection Algorithm</title>
				584
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	585	<para>Helgrind looks for memory locations which are accessed by more
				586	than one thread. For each such location, Helgrind records which of
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	587	the program's locks were held by the accessing thread at the time of
				588	each access. The hope is to discover that there is indeed at least
				589	one lock which is consistently used by all threads to protect that
				590	location. If no such lock can be found, then there is apparently no
				591	consistent locking strategy being applied for that location, and so a
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	592	possible data race might result. Helgrind accordingly reports an
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	593	error.</para>
				594
				595	<para>In practice this discipline is far too simplistic, and is
				596	unusable since it reports many races in some widely used and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	597	known-correct programming disciplines. Helgrind's checking therefore
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	598	incorporates many refinements to this basic idea, and can be
				599	summarised as follows:</para>
				600
				601	<para>The following thread events are intercepted and monitored:</para>
				602
				603	<itemizedlist>
				604	<listitem><para>thread creation and exiting (pthread_create,
				605	pthread_join, pthread_exit)</para>
				606	</listitem>
				607	<listitem>
				608	<para>lock acquisition and release (pthread_mutex_lock,
				609	pthread_mutex_unlock, pthread_rwlock_rdlock,
				610	pthread_rwlock_wrlock,
				611	pthread_rwlock_unlock)</para>
				612	</listitem>
				613	<listitem>
				614	<para>inter-thread event notifications (pthread_cond_wait,
				615	pthread_cond_signal, pthread_cond_broadcast,
				616	sem_wait, sem_post)</para>
				617	</listitem>
				618	</itemizedlist>
				619
				620	<para>Memory allocation and deallocation events are intercepted and
				621	monitored:</para>
				622
				623	<itemizedlist>
				624	<listitem>
				625	<para>malloc/new/free/delete and variants</para>
				626	</listitem>
				627	<listitem>
				628	<para>stack allocation and deallocation</para>
				629	</listitem>
				630	</itemizedlist>
				631
				632	<para>All memory accesses are intercepted and monitored.</para>
				633
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	634	<para>By observing the above events, Helgrind can infer certain
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	635	aspects of the program's locking discipline. Programs which adhere to
				636	the following rules are considered to be acceptable:
				637	</para>
				638
				639	<itemizedlist>
				640	<listitem>
				641	<para>A thread may allocate memory, and write initial values into
				642	it, without locking. That thread is regarded as owning the memory
				643	exclusively.</para>
				644	</listitem>
				645	<listitem>
				646	<para>A thread may read and write memory which it owns exclusively,
				647	without locking.</para>
				648	</listitem>
				649	<listitem>
				650	<para>Memory which is owned exclusively by one thread may be read by
				651	that thread and others without locking. However, in this situation
				652	no thread may do unlocked writes to the memory (except for the owner
				653	thread's initializing write).</para>
				654	</listitem>
				655	<listitem>
				656	<para>Memory which is shared between multiple threads, one or more
				657	of which writes to it, must be protected by a lock which is
				658	correctly acquired and released by all threads accessing the
				659	memory.</para>
				660	</listitem>
				661	</itemizedlist>
				662
				663	<para>Any violation of this discipline will cause an error to be reported.
				664	However, two exemptions apply:</para>
				665
				666	<itemizedlist>
				667	<listitem>
				668	<para>A thread Y can acquire exclusive ownership of memory
				669	previously owned exclusively by a different thread X providing
				670	X's last access and Y's first access are separated by one of the
				671	following synchronization events:</para>
				672	<itemizedlist>
				673	<listitem><para>X creates thread Y</para></listitem>
				674	<listitem><para>X joins back to Y</para></listitem>
				675	<listitem><para>X uses a condition-variable to signal at Y, and Y is
				676	waiting for that event</para></listitem>
				677	<listitem><para>Y completes a semaphore wait as a result of X signalling
				678	on that same semaphore</para></listitem>
				679	</itemizedlist>
				680	<para>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	681	This refinement allows Helgrind to correctly track the ownership
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	682	state of inter-thread buffers used in the worker-thread and
				683	worker-thread-pool concurrent programming idioms (styles).</para>
				684	</listitem>
				685	<listitem>
				686	<para>Similarly, if thread Y joins back to thread X, memory
				687	exclusively owned by Y becomes exclusively owned by X instead.
				688	Also, memory that has been shared only by X and Y becomes
				689	exclusively owned by X. More generally, memory that has been shared
				690	by X, Y and some arbitrary other set S of threads is re-marked as
				691	shared by X and S. Hence, under the right circumstances, memory
				692	shared amongst multiple threads, all of which join into just one,
				693	can revert to the exclusive ownership state.</para>
				694	<para>
				695	In effect, each memory location may make arbitrarily many
				696	transitions between exclusive and shared ownership. Furthermore, a
				697	different lock may protect the location during each period of shared
				698	ownership. This significantly enhances the flexibility of the
				699	algorithm.</para>
				700	</listitem>
				701	</itemizedlist>
				702
				703	<para>The ownership state, accessing thread-set and related lock-set
				704	for each memory location are tracked at 8-bit granularity. This means
				705	the algorithm is precise even for 16- and 8-bit memory
				706	accesses.</para>
				707
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	708	<para>Helgrind correctly handles reader-writer locks in this
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	709	framework. Locations shared between multiple threads can be protected
				710	during reads by locks held in either read-mode or write-mode, but can
				711	only be protected during writes by locks held in write-mode. Normal
				712	POSIX mutexes are treated as if they are reader-writer locks which are
				713	only ever held in write-mode.</para>
				714
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	715	<para>Helgrind correctly handles POSIX mutexes for which recursive
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	716	locking is allowed.</para>
				717
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	718	<para>Helgrind partially correctly handles x86 and amd64 memory access
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	719	instructions preceded by a LOCK prefix. Writes are correctly handled,
				720	by pretending that the LOCK prefix implies acquisition and release of
				721	a magic "bus hardware lock" mutex before and after the instruction.
				722	This unfortunately requires subsequent reads from such locations to
				723	also use a LOCK prefix, which is not required by the real hardware.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	724	Helgrind does not offer any equivalent handling for atomic sequences
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	725	on PowerPC/POWER platforms created by the use of lwarx/stwcx
				726	instructions.</para>
				727
				728	</sect2>
				729
				730
				731
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	732	<sect2 id="hg-manual.data-races.errmsgs" xreflabel="Race Error Messages">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	733	<title>Interpreting Race Error Messages</title>
				734
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	735	<para>Helgrind's race detection algorithm collects a lot of
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	736	information, and tries to present it in a helpful way when a race is
				737	detected. Here's an example:</para>
				738
				739	<programlisting><![CDATA[
				740	Thread #2 was created
				741	at 0x510548E: clone (in /lib64/libc-2.5.so)
				742	by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so)
				743	by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	744	by 0x4C23870: pthread_create@* (hg_intercepts.c:198)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	745	by 0x400CEF: main (tc17_sembar.c:195)
				746
				747	// And the same for threads #3, #4 and #5 -- omitted for conciseness
				748
				749	Possible data race during read of size 4 at 0x602174
				750	at 0x400BE5: gomp_barrier_wait (tc17_sembar.c:122)
				751	by 0x400C44: child (tc17_sembar.c:161)
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	752	by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	753	by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so)
				754	by 0x51054CC: clone (in /lib64/libc-2.5.so)
				755	Old state: shared-modified by threads #2, #3, #4, #5
				756	New state: shared-modified by threads #2, #3, #4, #5
				757	Reason: this thread, #2, holds no consistent locks
				758	Last consistently used lock for 0x602174 was first observed
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	759	at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	760	by 0x4009E4: gomp_barrier_init (tc17_sembar.c:46)
				761	by 0x400CBC: main (tc17_sembar.c:192)
				762	]]></programlisting>
				763
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	764	<para>Helgrind first announces the creation points of any threads
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	765	referenced in the error message. This is so it can speak concisely
				766	about threads and sets of threads without repeatedly printing their
				767	creation point call stacks. Each thread is only ever announced once,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	768	the first time it appears in any Helgrind error message.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	769
				770	<para>The main error message begins at the text
				771	"<computeroutput>Possible data race during read</computeroutput>".
				772	At the start is information you would expect to see -- address and
				773	size of the racing access, whether a read or a write, and the call
				774	stack at the point it was detected.</para>
				775
				776	<para>More interesting is the state transition caused by this access.
				777	This memory is already in the shared-modified state, and up to now has
				778	been consistently protected by at least one lock. However, the thread
				779	making the access in question (thread #2, here) does not hold any
				780	locks in common with those held during all previous accesses to the
				781	location -- "no consistent locks", in other words.</para>
				782
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	783	<para>Finally, Helgrind shows the lock which has protected this
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	784	location in all previous accesses. (If there is more than one, only
				785	one is shown). This can be a useful hint, because it typically shows
				786	the lock that the programmers intended to use to protect the location,
				787	but in this case forgot.</para>
				788
				789	<para>Here are some more examples of race reports. This not an
				790	exhaustive list of combinations, but should give you some insight into
				791	how to interpret the output.</para>
				792
				793	<programlisting><![CDATA[
				794	Possible data race during write ...
				795	Old state: shared-readonly by threads #1, #2, #3
				796	New state: shared-modified by threads #1, #2, #3
				797	Reason: this thread, #3, holds no consistent locks
				798	Location ... has never been protected by any lock
				799	]]></programlisting>
				800
				801	<para>The location is shared by 3 threads, all of which have been
				802	reading it without locking ("has never been protected by any lock").
				803	Now one of them is writing it. Regardless of whether the writer has a
				804	lock or not, this is still an error, because the write races against
				805	the previously observed reads.</para>
				806
				807	<programlisting><![CDATA[
				808	Possible data race during read ...
				809	Old state: shared-modified by threads #1, #2, #3
				810	New state: shared-modified by threads #1, #2, #3
				811	Reason: this thread, #3, holds no consistent locks
				812	Last consistently used lock for ... was first observed ...
				813	]]></programlisting>
				814
				815	<para>The location is shared by 3 threads, all of which have been
				816	reading and writing it while (as required) holding at least one lock
				817	in common. Now it is being read without that lock being held. In the
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	818	"Last consistently used lock" part, Helgrind offers its best guess as
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	819	to the identity of the lock that should have been used.</para>
				820
				821	<programlisting><![CDATA[
				822	Possible data race during write ...
				823	Old state: owned exclusively by thread #4
				824	New state: shared-modified by threads #4, #5
				825	Reason: this thread, #5, holds no locks at all
				826	]]></programlisting>
				827
				828	<para>A location that has so far been accessed exclusively by thread
				829	#4 has now been written by thread #5, without use of any lock. This
				830	can be a sign that the programmer did not consider the possibility of
				831	the location being shared between threads, or, alternatively, forgot
				832	to use the appropriate lock.</para>
				833
				834	<para>Note that thread #4 exclusively owns the location, and so has
				835	the right to access it without holding a lock. However, this message
				836	does not say that thread #4 is not using a lock for this location.
				837	Indeed, it could be using a lock for the location because it intends
				838	to make it available to other threads, one of which is thread #5 --
				839	and thread #5 has forgotten to use the lock.</para>
				840
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	841	<para>Also, this message implies that Helgrind did not see any
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	842	synchronisation event between threads #4 and #5 that would have
				843	allowed #5 to acquire exclusive ownership from #4. See
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	844	<link linkend="hg-manual.data-races.exclusive">above</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	845	for a discussion of transfers of exclusive ownership states between
				846	threads.</para>
				847
				848	</sect2>
				849
				850
				851	</sect1>
				852
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	853	<sect1 id="hg-manual.effective-use" xreflabel="Helgrind Effective Use">
				854	<title>Hints and Tips for Effective Use of Helgrind</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	855
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	856	<para>Helgrind can be very helpful in finding and resolving
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	857	threading-related problems. Like all sophisticated tools, it is most
				858	effective when you understand how to play to its strengths.</para>
				859
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	860	<para>Helgrind will be less effective when you merely throw an
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	861	existing threaded program at it and try to make sense of any reported
				862	errors. It will be more effective if you design threaded programs
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	863	from the start in a way that helps Helgrind verify correctness. The
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	864	same is true for finding memory errors with Memcheck, but applies more
				865	here, because thread checking is a harder problem. Consequently it is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	866	much easier to write a correct program for which Helgrind falsely
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	867	reports (threading) errors than it is to write a correct program for
				868	which Memcheck falsely reports (memory) errors.</para>
				869
				870	<para>With that in mind, here are some tips, listed most important first,
				871	for getting reliable results and avoiding false errors. The first two
				872	are critical. Any violations of them will swamp you with huge numbers
				873	of false data-race errors.</para>
				874
				875
				876	<orderedlist>
				877
				878	<listitem>
				879	<para>Make sure your application, and all the libraries it uses,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	880	use the POSIX threading primitives. Helgrind needs to be able to
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	881	see all events pertaining to thread creation, exit, locking and
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame^]	882	other synchronisation events. To do so it intercepts many POSIX
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	883	pthread_ functions.</para>
				884
				885	<para>Do not roll your own threading primitives (mutexes, etc)
				886	from combinations of the Linux futex syscall, counters and wotnot.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	887	These throw Helgrind's internal what's-going-on models way off
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	888	course and will give bogus results.</para>
				889
				890	<para>Also, do not reimplement existing POSIX abstractions using
				891	other POSIX abstractions. For example, don't build your own
				892	semaphore routines or reader-writer locks from POSIX mutexes and
				893	condition variables. Instead use POSIX reader-writer locks and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	894	semaphores directly, since Helgrind supports them directly.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	895
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	896	<para>Helgrind directly supports the following POSIX threading
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	897	abstractions: mutexes, reader-writer locks, condition variables
				898	(but see below), and semaphores. Currently spinlocks and barriers
				899	are not supported, although they could be in future. A prototype
				900	"safe" implementation of barriers, based on semaphores, is
				901	available: please contact the Valgrind authors for details.</para>
				902
				903	<para>At the time of writing, the following popular Linux packages
				904	are known to implement their own threading primitives:</para>
				905
				906	<itemizedlist>
				907	<listitem><para>Qt version 4.X. Qt 3.X is fine, but not 4.X.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	908	Helgrind contains partial direct support for Qt 4.X threading,
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	909	but this is not yet in a usable state. Assistance from folks
				910	knowledgeable in Qt 4 threading internals would be
				911	appreciated.</para></listitem>
				912
				913	<listitem><para>Runtime support library for GNU OpenMP (part of
				914	GCC), at least GCC versions 4.2 and 4.3. With some minor effort
				915	of modifying the GNU OpenMP runtime support sources, it is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	916	possible to use Helgrind on GNU OpenMP compiled codes. Please
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	917	contact the Valgrind authors for details.</para></listitem>
				918	</itemizedlist>
				919	</listitem>
				920
				921	<listitem>
				922	<para>Avoid memory recycling. If you can't avoid it, you must use
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	923	tell Helgrind what is going on via the VALGRIND_HG_CLEAN_MEMORY
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	924	client request
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	925	(in <computeroutput>helgrind.h</computeroutput>).</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	926
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	927	<para>Helgrind is aware of standard memory allocation and
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	928	deallocation that occurs via malloc/free/new/delete and from entry
				929	and exit of stack frames. In particular, when memory is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	930	deallocated via free, delete, or function exit, Helgrind considers
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	931	that memory clean, so when it is eventually reallocated, its
				932	history is irrelevant.</para>
				933
				934	<para>However, it is common practice to implement memory recycling
				935	schemes. In these, memory to be freed is not handed to
				936	malloc/delete, but instead put into a pool of free buffers to be
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	937	handed out again as required. The problem is that Helgrind has no
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	938	way to know that such memory is logically no longer in use, and
				939	its history is irrelevant. Hence you must make that explicit,
				940	using the VALGRIND_HG_CLEAN_MEMORY client request to specify the
				941	relevant address ranges. It's easiest to put these requests into
				942	the pool manager code, and use them either when memory is returned
				943	to the pool, or is allocated from it.</para>
				944	</listitem>
				945
				946	<listitem>
				947	<para>Avoid POSIX condition variables. If you can, use POSIX
				948	semaphores (sem_t, sem_post, sem_wait) to do inter-thread event
				949	signalling. Semaphores with an initial value of zero are
				950	particularly useful for this.</para>
				951
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	952	<para>Helgrind only partially correctly handles POSIX condition
				953	variables. This is because Helgrind can see inter-thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	954	dependencies between a pthread_cond_wait call and a
				955	pthread_cond_signal/broadcast call only if the waiting thread
				956	actually gets to the rendezvous first (so that it actually calls
				957	pthread_cond_wait). It can't see dependencies between the threads
				958	if the signaller arrives first. In the latter case, POSIX
				959	guidelines imply that the associated boolean condition still
				960	provides an inter-thread synchronisation event, but one which is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	961	invisible to Helgrind.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	962
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	963	<para>The result of Helgrind missing some inter-thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	964	synchronisation events is to cause it to report false positives.
				965	That's because missing such events reduces the extent to which it
				966	can transfer exclusive memory ownership between threads. So
				967	memory may end up in a shared-modified state when that was not
				968	intended by the application programmers.</para>
				969
				970	<para>The root cause of this synchronisation lossage is
				971	particularly hard to understand, so an example is helpful. It was
				972	discussed at length by Arndt Muehlenfeld ("Runtime Race Detection
				973	in Multi-Threaded Programs", Dissertation, TU Graz, Austria). The
				974	canonical POSIX-recommended usage scheme for condition variables
				975	is as follows:</para>
				976
				977	<programlisting><![CDATA[
				978	b is a Boolean condition, which is False most of the time
				979	cv is a condition variable
				980	mx is its associated mutex
				981
				982	Signaller: Waiter:
				983
				984	lock(mx) lock(mx)
				985	b = True while (b == False)
				986	signal(cv) wait(cv,mx)
				987	unlock(mx) unlock(mx)
				988	]]></programlisting>
				989
				990	<para>Assume <computeroutput>b</computeroutput> is False most of
				991	the time. If the waiter arrives at the rendezvous first, it
				992	enters its while-loop, waits for the signaller to signal, and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	993	eventually proceeds. Helgrind sees the signal, notes the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	994	dependency, and all is well.</para>
				995
				996	<para>If the signaller arrives
				997	first, <computeroutput>b</computeroutput> is set to true, and the
				998	signal disappears into nowhere. When the waiter later arrives, it
				999	does not enter its while-loop and simply carries on. But even in
				1000	this case, the waiter code following the while-loop cannot execute
				1001	until the signaller sets <computeroutput>b</computeroutput> to
				1002	True. Hence there is still the same inter-thread dependency, but
				1003	this time it is through an arbitrary in-memory condition, and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1004	Helgrind cannot see it.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1005
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1006	<para>By comparison, Helgrind's detection of inter-thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1007	dependencies caused by semaphore operations is believed to be
				1008	exactly correct.</para>
				1009
				1010	<para>As far as I know, a solution to this problem that does not
				1011	require source-level annotation of condition-variable wait loops
				1012	is beyond the current state of the art.</para>
				1013	</listitem>
				1014
				1015	<listitem>
				1016	<para>Make sure you are using a supported Linux distribution. At
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1017	present, Helgrind only properly supports x86-linux and amd64-linux
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1018	with glibc-2.3 or later. The latter restriction means we only
				1019	support glibc's NPTL threading implementation. The old
				1020	LinuxThreads implementation is not supported.</para>
				1021
				1022	<para>Unsupported targets may work to varying degrees. In
				1023	particular ppc32-linux and ppc64-linux running NTPL should work,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1024	but you will get false race errors because Helgrind does not know
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1025	how to properly handle atomic instruction sequences created using
				1026	the lwarx/stwcx instructions.</para>
				1027	</listitem>
				1028
				1029	<listitem>
				1030	<para>Round up all finished threads using pthread_join. Avoid
				1031	detaching threads: don't create threads in the detached state, and
				1032	don't call pthread_detach on existing threads.</para>
				1033
				1034	<para>Using pthread_join to round up finished threads provides a
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1035	clear synchronisation point that both Helgrind and programmers can
				1036	see. This synchronisation point allows Helgrind to adjust its
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1037	memory ownership
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1038	models <link linkend="hg-manual.data-races.exclusive">as described
				1039	extensively above</link>, which helps Helgrind produce more
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1040	accurate error reports.</para>
				1041
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1042	<para>If you don't call pthread_join on a thread, Helgrind has no
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1043	way to know when it finishes, relative to any significant
				1044	synchronisation points for other threads in the program. So it
				1045	assumes that the thread lingers indefinitely and can potentially
				1046	interfere indefinitely with the memory state of the program. It
				1047	has every right to assume that -- after all, it might really be
				1048	the case that, for scheduling reasons, the exiting thread did run
				1049	very slowly in the last stages of its life.</para>
				1050	</listitem>
				1051
				1052	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1053	<para>Perform thread debugging (with Helgrind) and memory
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1054	debugging (with Memcheck) together.</para>
				1055
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1056	<para>Helgrind tracks the state of memory in detail, and memory
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1057	management bugs in the application are liable to cause confusion.
				1058	In extreme cases, applications which do many invalid reads and
				1059	writes (particularly to freed memory) have been known to crash
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1060	Helgrind. So, ideally, you should make your application
				1061	Memcheck-clean before using Helgrind.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1062
				1063	<para>It may be impossible to make your application Memcheck-clean
				1064	unless you first remove threading bugs. In particular, it may be
				1065	difficult to remove all reads and writes to freed memory in
				1066	multithreaded C++ destructor sequences at program termination.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1067	So, ideally, you should make your application Helgrind-clean
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1068	before using Memcheck.</para>
				1069
				1070	<para>Since this circularity is obviously unresolvable, at least
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1071	bear in mind that Memcheck and Helgrind are to some extent
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1072	complementary, and you may need to use them together.</para>
				1073	</listitem>
				1074
				1075	<listitem>
				1076	<para>POSIX requires that implementations of standard I/O (printf,
				1077	fprintf, fwrite, fread, etc) are thread safe. Unfortunately GNU
				1078	libc implements this by using internal locking primitives that
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1079	Helgrind is unable to intercept. Consequently Helgrind generates
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1080	many false race reports when you use these functions.</para>
				1081
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1082	<para>Helgrind attempts to hide these errors using the standard
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1083	Valgrind error-suppression mechanism. So, at least for simple
				1084	test cases, you don't see any. Nevertheless, some may slip
				1085	through. Just something to be aware of.</para>
				1086	</listitem>
				1087
				1088	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1089	<para>Helgrind's error checks do not work properly inside the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1090	system threading library itself
				1091	(<computeroutput>libpthread.so</computeroutput>), and it usually
				1092	observes large numbers of (false) errors in there. Valgrind's
				1093	suppression system then filters these out, so you should not see
				1094	them.</para>
				1095
				1096	<para>If you see any race errors reported
				1097	where <computeroutput>libpthread.so</computeroutput> or
				1098	<computeroutput>ld.so</computeroutput> is the object associated
				1099	with the innermost stack frame, please file a bug report at
				1100	http://www.valgrind.org.</para>
				1101	</listitem>
				1102
				1103	</orderedlist>
				1104
				1105	</sect1>
				1106
				1107
				1108
				1109
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1110	<sect1 id="hg-manual.options" xreflabel="Helgrind Options">
				1111	<title>Helgrind Options</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1112
				1113	<para>The following end-user options are available:</para>
				1114
				1115	<!-- start of xi:include in the manpage -->
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1116	<variablelist id="hg.opts.list">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1117
				1118	<varlistentry id="opt.happens-before" xreflabel="--happens-before">
				1119	<term>
				1120	<option><![CDATA[--happens-before=none\|threads\|all
				1121	[default: all] ]]></option>
				1122	</term>
				1123	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1124	<para>Helgrind always regards locks as the basis for
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1125	inter-thread synchronisation. However, by default, before
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1126	reporting a race error, Helgrind will also check whether
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1127	certain other kinds of inter-thread synchronisation events
				1128	happened. It may be that if such events took place, then no
				1129	race really occurred, and so no error needs to be reported.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1130	See <link linkend="hg-manual.data-races.exclusive">above</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1131	for a discussion of transfers of exclusive ownership states
				1132	between threads.
				1133	</para>
				1134	<para>With <varname>--happens-before=all</varname>, the
				1135	following events are regarded as sources of synchronisation:
				1136	thread creation/joinage, condition variable
				1137	signal/broadcast/waits, and semaphore posts/waits.
				1138	</para>
				1139	<para>With <varname>--happens-before=threads</varname>, only
				1140	thread creation/joinage events are regarded as sources of
				1141	synchronisation.
				1142	</para>
				1143	<para>With <varname>--happens-before=none</varname>, no events
				1144	(apart, of course, from locking) are regarded as sources of
				1145	synchronisation.
				1146	</para>
				1147	<para>Changing this setting from the default will increase your
				1148	false-error rate but give little or no gain. The only advantage
				1149	is that <option>--happens-before=threads</option> and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1150	<option>--happens-before=none</option> should make Helgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1151	less and less sensitive to the scheduling of threads, and hence
				1152	the output more and more repeatable across runs.
				1153	</para>
				1154	</listitem>
				1155	</varlistentry>
				1156
				1157	<varlistentry id="opt.trace-addr" xreflabel="--trace-addr">
				1158	<term>
				1159	<option><![CDATA[--trace-addr=0xXXYYZZ
				1160	]]></option> and
				1161	<option><![CDATA[--trace-level=0\|1\|2 [default: 1]
				1162	]]></option>
				1163	</term>
				1164	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1165	<para>Requests that Helgrind produces a log of all state changes
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1166	to location 0xXXYYZZ. This can be helpful in tracking down
				1167	tricky races. <varname>--trace-level</varname> controls the
				1168	verbosity of the log. At the default setting (1), a one-line
				1169	summary of is printed for each state change. At level 2 a
				1170	complete stack trace is printed for each state change.</para>
				1171	</listitem>
				1172	</varlistentry>
				1173
				1174	</variablelist>
				1175	<!-- end of xi:include in the manpage -->
				1176
				1177	<!-- start of xi:include in the manpage -->
				1178	<para>In addition, the following debugging options are available for
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1179	Helgrind:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1180
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1181	<variablelist id="hg.debugopts.list">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1182
				1183	<varlistentry id="opt.trace-malloc" xreflabel="--trace-malloc">
				1184	<term>
				1185	<option><![CDATA[--trace-malloc=no\|yes [no]
				1186	]]></option>
				1187	</term>
				1188	<listitem>
				1189	<para>Show all client malloc (etc) and free (etc) requests.</para>
				1190	</listitem>
				1191	</varlistentry>
				1192
				1193	<varlistentry id="opt.gen-vcg" xreflabel="--gen-vcg">
				1194	<term>
				1195	<option><![CDATA[--gen-vcg=no\|yes\|yes-w-vts [no]
				1196	]]></option>
				1197	</term>
				1198	<listitem>
				1199	<para>At exit, write to stderr a dump of the happens-before
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1200	graph computed by Helgrind, in a format suitable for the VCG
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1201	graph visualisation tool. A suitable command line is:</para>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1202	<para><computeroutput>valgrind --tool=helgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1203	--gen-vcg=yes my_app 2>&1
				1204	\| grep xxxxxx \| sed "s/xxxxxx//g"
				1205	\| xvcg -</computeroutput></para>
				1206	<para>With <varname>--gen-vcg=yes</varname>, the basic
				1207	happens-before graph is shown. With
				1208	<varname>--gen-vcg=yes-w-vts</varname>, the vector timestamp
				1209	for each node is also shown.</para>
				1210	</listitem>
				1211	</varlistentry>
				1212
				1213	<varlistentry id="opt.cmp-race-err-addrs"
				1214	xreflabel="--cmp-race-err-addrs">
				1215	<term>
				1216	<option><![CDATA[--cmp-race-err-addrs=no\|yes [no]
				1217	]]></option>
				1218	</term>
				1219	<listitem>
				1220	<para>Controls whether or not race (data) addresses should be
				1221	taken into account when removing duplicates of race errors.
				1222	With <varname>--cmp-race-err-addrs=no</varname>, two otherwise
				1223	identical race errors will be considered to be the same if
				1224	their race addresses differ. With
				1225	With <varname>--cmp-race-err-addrs=yes</varname> they will be
				1226	considered different. This is provided to help make certain
				1227	regression tests work reliably.</para>
				1228	</listitem>
				1229	</varlistentry>
				1230
				1231	<varlistentry id="opt.tc-sanity-flags" xreflabel="--tc-sanity-flags">
				1232	<term>
				1233	<option><![CDATA[--tc-sanity-flags=<XXXXX> (X = 0\|1) [00000]
				1234	]]></option>
				1235	</term>
				1236	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1237	<para>Run extensive sanity checks on Helgrind's internal
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1238	data structures at events defined by the bitstring, as
				1239	follows:</para>
				1240	<para><computeroutput>10000 </computeroutput>after changes to
				1241	the lock order acquisition graph</para>
				1242	<para><computeroutput>01000 </computeroutput>after every client
				1243	memory access (NB: not currently used)</para>
				1244	<para><computeroutput>00100 </computeroutput>after every client
				1245	memory range permission setting of 256 bytes or greater</para>
				1246	<para><computeroutput>00010 </computeroutput>after every client
				1247	lock or unlock event</para>
				1248	<para><computeroutput>00001 </computeroutput>after every client
				1249	thread creation or joinage event</para>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1250	<para>Note these will make Helgrind run very slowly, often to
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1251	the point of being completely unusable.</para>
				1252	</listitem>
				1253	</varlistentry>
				1254
				1255	</variablelist>
				1256	<!-- end of xi:include in the manpage -->
				1257
				1258
				1259	</sect1>
				1260
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1261	<sect1 id="hg-manual.todolist" xreflabel="To Do List">
				1262	<title>A To-Do List for Helgrind</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1263
				1264	<para>The following is a list of loose ends which should be tidied up
				1265	some time.</para>
				1266
				1267	<itemizedlist>
				1268	<listitem><para>Track which mutexes are associated with which
				1269	condition variables, and emit a warning if this becomes
				1270	inconsistent.</para>
				1271	</listitem>
				1272	<listitem><para>For lock order errors, print the complete lock
				1273	cycle, rather than only doing for size-2 cycles as at
				1274	present.</para>
				1275	</listitem>
				1276	<listitem><para>Document the VALGRIND_HG_CLEAN_MEMORY client
				1277	request.</para>
				1278	</listitem>
				1279	<listitem><para>Possibly a client request to forcibly transfer
				1280	ownership of memory from one thread to another. Requires further
				1281	consideration.</para>
				1282	</listitem>
				1283	<listitem><para>Add a new client request that marks an address range
				1284	as being "shared-modified with empty lockset" (the error state),
				1285	and describe how to use it.</para>
				1286	</listitem>
				1287	<listitem><para>Document races caused by gcc's thread-unsafe code
				1288	generation for speculative stores. In the interim see
				1289	<computeroutput>http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html
				1290	</computeroutput>
				1291	and <computeroutput>http://lkml.org/lkml/2007/10/24/673</computeroutput>.
				1292	</para>
				1293	</listitem>
				1294	<listitem><para>Don't update the lock-order graph, and don't check
				1295	for errors, when a "try"-style lock operation happens (eg
				1296	pthread_mutex_trylock). Such calls do not add any real
				1297	restrictions to the locking order, since they can always fail to
				1298	acquire the lock, resulting in the caller going off and doing Plan
				1299	B (presumably it will have a Plan B). Doing such checks could
				1300	generate false lock-order errors and confuse users.</para>
				1301	</listitem>
				1302	<listitem><para> Performance can be very poor. Slowdowns on the
				1303	order of 100:1 are not unusual. There is quite some scope for
				1304	performance improvements, though.
				1305	</para>
				1306	</listitem>
				1307
				1308	</itemizedlist>
				1309
				1310	</sect1>
				1311
				1312	</chapter>