Blame - helgrind/docs/hg-manual.xml - platform/external/valgrind

blob: 73dde8c69dbcffaf270c6e6ccbf162cce3c472de [file] [log] [blame]

sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
				4
				5
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	6	<chapter id="hg-manual" xreflabel="Helgrind: thread error detector">
				7	<title>Helgrind: a thread error detector</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	8
				9	<para>To use this tool, you must specify
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	10	<computeroutput>--tool=helgrind</computeroutput> on the Valgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	11	command line.</para>
				12
				13
				14
				15
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	16	<sect1 id="hg-manual.overview" xreflabel="Overview">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	17	<title>Overview</title>
				18
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	19	<para>Helgrind is a Valgrind tool for detecting synchronisation errors
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	20	in C, C++ and Fortran programs that use the POSIX pthreads
				21	threading primitives.</para>
				22
				23	<para>The main abstractions in POSIX pthreads are: a set of threads
				24	sharing a common address space, thread creation, thread joinage,
				25	thread exit, mutexes (locks), condition variables (inter-thread event
				26	notifications), reader-writer locks, and semaphores.</para>
				27
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	28	<para>Helgrind is aware of all these abstractions and tracks their
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	29	effects as accurately as it can. Currently it does not correctly
				30	handle pthread barriers and pthread spinlocks, although it will not
				31	object if you use them. On x86 and amd64 platforms, it understands
				32	and partially handles implicit locking arising from the use of the
				33	LOCK instruction prefix.
				34	</para>
				35
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	36	<para>Helgrind can detect three classes of errors, which are discussed
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	37	in detail in the next three sections:</para>
				38
				39	<orderedlist>
				40	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	41	<para><link linkend="hg-manual.api-checks">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	42	Misuses of the POSIX pthreads API.</link></para>
				43	</listitem>
				44	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	45	<para><link linkend="hg-manual.lock-orders">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	46	Potential deadlocks arising from lock
				47	ordering problems.</link></para>
				48	</listitem>
				49	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	50	<para><link linkend="hg-manual.data-races">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	51	Data races -- accessing memory without adequate locking.
				52	</link></para>
				53	</listitem>
				54	</orderedlist>
				55
				56	<para>Following those is a section containing
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	57	<link linkend="hg-manual.effective-use">
				58	hints and tips on how to get the best out of Helgrind.</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	59	</para>
				60
				61	<para>Then there is a
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	62	<link linkend="hg-manual.options">summary of command-line
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	63	options.</link>
				64	</para>
				65
				66	<para>Finally, there is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	67	<link linkend="hg-manual.todolist">a brief summary of areas in which Helgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	68	could be improved.</link>
				69	</para>
				70
				71	</sect1>
				72
				73
				74
				75
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	76	<sect1 id="hg-manual.api-checks" xreflabel="API Checks">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	77	<title>Detected errors: Misuses of the POSIX pthreads API</title>
				78
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	79	<para>Helgrind intercepts calls to many POSIX pthreads functions, and
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	80	is therefore able to report on various common problems. Although
				81	these are unglamourous errors, their presence can lead to undefined
				82	program behaviour and hard-to-find bugs later in execution. The
				83	detected errors are:</para>
				84
				85	<itemizedlist>
				86	<listitem><para>unlocking an invalid mutex</para></listitem>
				87	<listitem><para>unlocking a not-locked mutex</para></listitem>
				88	<listitem><para>unlocking a mutex held by a different
				89	thread</para></listitem>
				90	<listitem><para>destroying an invalid or a locked mutex</para></listitem>
				91	<listitem><para>recursively locking a non-recursive mutex</para></listitem>
				92	<listitem><para>deallocation of memory that contains a
				93	locked mutex</para></listitem>
				94	<listitem><para>passing mutex arguments to functions expecting
				95	reader-writer lock arguments, and vice
				96	versa</para></listitem>
				97	<listitem><para>when a POSIX pthread function fails with an
				98	error code that must be handled</para></listitem>
				99	<listitem><para>when a thread exits whilst still holding locked
				100	locks</para></listitem>
				101	<listitem><para>calling <computeroutput>pthread_cond_wait</computeroutput>
				102	with a not-locked mutex, or one locked by a different
				103	thread</para></listitem>
				104	</itemizedlist>
				105
				106	<para>Checks pertaining to the validity of mutexes are generally also
				107	performed for reader-writer locks.</para>
				108
				109	<para>Various kinds of this-can't-possibly-happen events are also
				110	reported. These usually indicate bugs in the system threading
				111	library.</para>
				112
				113	<para>Reported errors always contain a primary stack trace indicating
				114	where the error was detected. They may also contain auxiliary stack
				115	traces giving additional information. In particular, most errors
				116	relating to mutexes will also tell you where that mutex first came to
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	117	Helgrind's attention (the "<computeroutput>was first observed
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	118	at</computeroutput>" part), so you have a chance of figuring out which
				119	mutex it is referring to. For example:</para>
				120
				121	<programlisting><![CDATA[
				122	Thread #1 unlocked a not-locked lock at 0x7FEFFFA90
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	123	at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.c:492)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	124	by 0x40073A: nearly_main (tc09_bad_unlock.c:27)
				125	by 0x40079B: main (tc09_bad_unlock.c:50)
				126	Lock at 0x7FEFFFA90 was first observed
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	127	at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	128	by 0x40071F: nearly_main (tc09_bad_unlock.c:23)
				129	by 0x40079B: main (tc09_bad_unlock.c:50)
				130	]]></programlisting>
				131
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	132	<para>Helgrind has a way of summarising thread identities, as
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	133	evidenced here by the text "<computeroutput>Thread
				134	#1</computeroutput>". This is so that it can speak about threads and
				135	sets of threads without overwhelming you with details. See
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	136	<link linkend="hg-manual.data-races.errmsgs">below</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	137	for more information on interpreting error messages.</para>
				138
				139	</sect1>
				140
				141
				142
				143
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	144	<sect1 id="hg-manual.lock-orders" xreflabel="Lock Orders">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	145	<title>Detected errors: Inconsistent Lock Orderings</title>
				146
				147	<para>In this section, and in general, to "acquire" a lock simply
				148	means to lock that lock, and to "release" a lock means to unlock
				149	it.</para>
				150
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	151	<para>Helgrind monitors the order in which threads acquire locks.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	152	This allows it to detect potential deadlocks which could arise from
				153	the formation of cycles of locks. Detecting such inconsistencies is
				154	useful because, whilst actual deadlocks are fairly obvious, potential
				155	deadlocks may never be discovered during testing and could later lead
				156	to hard-to-diagnose in-service failures.</para>
				157
				158	<para>The simplest example of such a problem is as
				159	follows.</para>
				160
				161	<itemizedlist>
				162	<listitem><para>Imagine some shared resource R, which, for whatever
				163	reason, is guarded by two locks, L1 and L2, which must both be held
				164	when R is accessed.</para>
				165	</listitem>
				166	<listitem><para>Suppose a thread acquires L1, then L2, and proceeds
				167	to access R. The implication of this is that all threads in the
				168	program must acquire the two locks in the order first L1 then L2.
				169	Not doing so risks deadlock.</para>
				170	</listitem>
				171	<listitem><para>The deadlock could happen if two threads -- call them
				172	T1 and T2 -- both want to access R. Suppose T1 acquires L1 first,
				173	and T2 acquires L2 first. Then T1 tries to acquire L2, and T2 tries
				174	to acquire L1, but those locks are both already held. So T1 and T2
				175	become deadlocked.</para>
				176	</listitem>
				177	</itemizedlist>
				178
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	179	<para>Helgrind builds a directed graph indicating the order in which
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	180	locks have been acquired in the past. When a thread acquires a new
				181	lock, the graph is updated, and then checked to see if it now contains
				182	a cycle. The presence of a cycle indicates a potential deadlock involving
				183	the locks in the cycle.</para>
				184
				185	<para>In simple situations, where the cycle only contains two locks,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	186	Helgrind will show where the required order was established:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	187
				188	<programlisting><![CDATA[
				189	Thread #1: lock order "0x7FEFFFAB0 before 0x7FEFFFA80" violated
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	190	at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	191	by 0x40081F: main (tc13_laog1.c:24)
				192	Required order was established by acquisition of lock at 0x7FEFFFAB0
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	193	at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	194	by 0x400748: main (tc13_laog1.c:17)
				195	followed by a later acquisition of lock at 0x7FEFFFA80
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	196	at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	197	by 0x400773: main (tc13_laog1.c:18)
				198	]]></programlisting>
				199
				200	<para>When there are more than two locks in the cycle, the error is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	201	equally serious. However, at present Helgrind does not show the locks
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	202	involved, so as to avoid flooding you with information. That could be
				203	fixed in future. For example, here is a an example involving a cycle
				204	of five locks from a naive implementation the famous Dining
				205	Philosophers problem
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	206	(see <computeroutput>helgrind/tests/tc14_laog_dinphils.c</computeroutput>).
				207	In this case Helgrind has detected that all 5 philosophers could
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	208	simultaneously pick up their left fork and then deadlock whilst
				209	waiting to pick up their right forks.</para>
				210
				211	<programlisting><![CDATA[
				212	Thread #6: lock order "0x6010C0 before 0x601160" violated
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	213	at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	214	by 0x4007C0: dine (tc14_laog_dinphils.c:19)
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	215	by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	216	by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so)
				217	by 0x51054CC: clone (in /lib64/libc-2.5.so)
				218	]]></programlisting>
				219
				220	</sect1>
				221
				222
				223
				224
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	225	<sect1 id="hg-manual.data-races" xreflabel="Data Races">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	226	<title>Detected errors: Data Races</title>
				227
				228	<para>A data race happens, or could happen, when two threads
				229	access a shared memory location without using suitable locks to
				230	ensure single-threaded access. Such missing locking can cause
				231	obscure timing dependent bugs. Ensuring programs are race-free is
				232	one of the central difficulties of threaded programming.</para>
				233
				234	<para>Reliably detecting races is a difficult problem, and most
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	235	of Helgrind's internals are devoted to do dealing with it.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	236	As a consequence this section is somewhat long and involved.
				237	We begin with a simple example.</para>
				238
				239
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	240	<sect2 id="hg-manual.data-races.example" xreflabel="Simple Race">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	241	<title>A Simple Data Race</title>
				242
				243	<para>About the simplest possible example of a race is as follows. In
				244	this program, it is impossible to know what the value
				245	of <computeroutput>var</computeroutput> is at the end of the program.
				246	Is it 2 ? Or 1 ?</para>
				247
				248	<programlisting><![CDATA[
				249	#include <pthread.h>
				250
				251	int var = 0;
				252
				253	void* child_fn ( void* arg ) {
				254	var++; /* Unprotected relative to parent / / this is line 6 */
				255	return NULL;
				256	}
				257
				258	int main ( void ) {
				259	pthread_t child;
				260	pthread_create(&child, NULL, child_fn, NULL);
				261	var++; /* Unprotected relative to child / / this is line 13 */
				262	pthread_join(child, NULL);
				263	return 0;
				264	}
				265	]]></programlisting>
				266
				267	<para>The problem is there is nothing to
				268	stop <computeroutput>var</computeroutput> being updated simultaneously
				269	by both threads. A correct program would
				270	protect <computeroutput>var</computeroutput> with a lock of type
				271	<computeroutput>pthread_mutex_t</computeroutput>, which is acquired
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	272	before each access and released afterwards. Helgrind's output for
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	273	this program is:</para>
				274
				275	<programlisting><![CDATA[
				276	Thread #1 is the program's root thread
				277
				278	Thread #2 was created
				279	at 0x510548E: clone (in /lib64/libc-2.5.so)
				280	by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so)
				281	by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	282	by 0x4C23870: pthread_create@* (hg_intercepts.c:198)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	283	by 0x4005F1: main (simple_race.c:12)
				284
				285	Possible data race during write of size 4 at 0x601034
				286	at 0x4005F2: main (simple_race.c:13)
				287	Old state: shared-readonly by threads #1, #2
				288	New state: shared-modified by threads #1, #2
				289	Reason: this thread, #1, holds no consistent locks
				290	Location 0x601034 has never been protected by any lock
				291	]]></programlisting>
				292
				293	<para>This is quite a lot of detail for an apparently simple error.
				294	The last clause is the main error message. It says there is a race as
				295	a result of a write of size 4 (bytes), at 0x601034, which is
				296	presumably the address of <computeroutput>var</computeroutput>,
				297	happening in function <computeroutput>main</computeroutput> at line 13
				298	in the program.</para>
				299
				300	<para>Note that it is purely by chance that the race is
				301	reported for the parent thread's access. It could equally have been
				302	reported instead for the child's access, at line 6. The error will
				303	only be reported for one of the locations, since neither the parent
				304	nor child is, by itself, incorrect. It is only when both access
				305	<computeroutput>var</computeroutput> without a lock that an error
				306	exists.</para>
				307
				308	<para>The error message shows some other interesting details. The
				309	sections below explain them. Here we merely note their presence:</para>
				310
				311	<itemizedlist>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	312	<listitem><para>Helgrind maintains some kind of state machine for the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	313	memory location in question, hence the "<computeroutput>Old
				314	state:</computeroutput>" and "<computeroutput>New
				315	state:</computeroutput>" lines.</para>
				316	</listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	317	<listitem><para>Helgrind keeps track of which threads have accessed
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	318	the location: "<computeroutput>threads #1, #2</computeroutput>".
				319	Before printing the main error message, it prints the creation
				320	points of these two threads, so you can see which threads it is
				321	referring to.</para>
				322	</listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	323	<listitem><para>Helgrind tries to provide an explaination of why the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	324	race exists: "<computeroutput>Location 0x601034 has never been
				325	protected by any lock</computeroutput>".</para>
				326	</listitem>
				327	</itemizedlist>
				328
				329	<para>Understanding the memory state machine is central to
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	330	understanding Helgrind's race-detection algorithm. The next three
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	331	subsections explain this.</para>
				332
				333	</sect2>
				334
				335
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	336	<sect2 id="hg-manual.data-races.memstates" xreflabel="Memory States">
				337	<title>Helgrind's Memory State Machine</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	338
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	339	<para>Helgrind tracks the state of every byte of memory used by your
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	340	program. There are a number of states, but only three are
				341	interesting:</para>
				342
				343	<itemizedlist>
				344	<listitem><para>Exclusive: memory in this state is regarded as owned
				345	exclusively by one particular thread. That thread may read and
				346	write it without a lock. Even in highly threaded programs, the
				347	majority of locations never leave the Exclusive state, since most
				348	data is thread-private.</para>
				349	</listitem>
				350	<listitem><para>Shared-Readonly: memory in this state is regarded as
				351	shared by multiple threads. In this state, any thread may read the
				352	memory without a lock, reflecting the fact that readonly data may
				353	safely be shared between threads without locking.</para>
				354	</listitem>
				355	<listitem><para>Shared-Modified: memory in this state is regarded as
				356	shared by multiple threads, at least one of which has written to it.
				357	All participating threads must hold at least one lock in common when
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	358	accessing the memory. If no such lock exists, Helgrind reports a
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	359	race error.</para>
				360	</listitem>
				361	</itemizedlist>
				362
				363	<para>Let's review the simple example above with this in mind. When
				364	the program starts, <computeroutput>var</computeroutput> is not in any
				365	of these states. Either the parent or child thread gets to its
				366	<computeroutput>var++</computeroutput> first, and thereby
				367	thereby gets Exclusive ownership of the location.</para>
				368
				369	<para>The later-running thread now arrives at
				370	its <computeroutput>var++</computeroutput> statement. It first reads
				371	the existing value from memory.
				372	Because <computeroutput>var</computeroutput> is currently marked as
				373	owned exclusively by the other thread, its state is changed to
				374	shared-readonly by both threads.</para>
				375
				376	<para>This same thread adds one to the value it has and stores it back
				377	in <computeroutput>var</computeroutput>. This causes another state
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	378	change, this time to the shared-modified state. Because Helgrind has
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	379	also been tracking which threads hold which locks, it can see that
				380	<computeroutput>var</computeroutput> is in shared-modified state but
				381	no lock has been used to consistently protect it. Hence a race is
				382	reported exactly at the transition from shared-readonly to
				383	shared-modified.</para>
				384
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	385	<para>The essence of the algorithm is this. Helgrind keeps track of
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	386	each memory location that has been accessed by more than one thread.
				387	For each such location it incrementally infers the set of locks which
				388	have consistently been used to protect that location. If the
				389	location's lockset becomes empty, and at some point one of the threads
				390	attempts to write to it, a race is then reported.</para>
				391
				392	<para>This technique is known as "lockset inference" and was
				393	introduced in: "Eraser: A Dynamic Data Race Detector for Multithreaded
				394	Programs" (Stefan Savage, Michael Burrows, Greg Nelson, Patrick
				395	Sobalvarro and Thomas Anderson, ACM Transactions on Computer Systems,
				396	15(4):391-411, November 1997).</para>
				397
				398	<para>Lockset inference has since been widely implemented, studied and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	399	extended. Helgrind incorporates several refinements aimed at avoiding
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	400	the high false error rate that naive versions of the algorithm suffer
				401	from. A
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	402	<link linkend="hg-manual.data-races.summary">summary of the complete
				403	algorithm used by Helgrind</link> is presented below. First, however,
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	404	it is important to understand details of transitions pertaining to the
				405	Exclusive-ownership state.</para>
				406
				407	</sect2>
				408
				409
				410
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	411	<sect2 id="hg-manual.data-races.exclusive" xreflabel="Excl Transfers">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	412	<title>Transfers of Exclusive Ownership Between Threads</title>
				413
				414	<para>As presented, the algorithm is far too strict. It reports many
				415	errors in perfectly correct, widely used parallel programming
				416	constructions, for example, using child worker threads and worker
				417	thread pools.</para>
				418
				419	<para>To avoid these false errors, we must refine the algorithm so
				420	that it keeps memory in an Exclusive ownership state in cases where it
				421	would otherwise decay into a shared-readonly or shared-modified state.
				422	Recall that Exclusive ownership is special in that it grants the
				423	owning thread the right to access memory without use of any locks. In
				424	order to support worker-thread and worker-thread-pool idioms, we will
				425	allow threads to steal exclusive ownership of memory from other
				426	threads under certain circumstances.</para>
				427
				428	<para>Here's an example. Imagine a parent thread creates child
				429	threads to do units of work. For each unit of work, the parent
				430	allocates a work buffer, fills it in, and creates the child thread,
				431	handing it a pointer to the buffer. The child reads/writes the buffer
				432	and eventually exits, and the waiting parent then extracts the results
				433	from the buffer:</para>
				434
				435	<programlisting><![CDATA[
				436	typedef ... Buffer;
				437
				438	pthread_t child;
				439	Buffer buf;
				440
				441	/* ---- Parent ---- / / ---- Child ---- */
				442
				443	/* parent writes workload into buf */
				444	pthread_create( &child, child_fn, &buf );
				445
				446	/* parent does not read / void child_fn ( Buffer buf ) {
				447	/* or write buf / / read/write buf */
				448	}
				449
				450	pthread_join ( child );
				451	/* parent reads results from buf */
				452	]]></programlisting>
				453
				454	<para>Although <computeroutput>buf</computeroutput> is accessed by
				455	both threads, neither uses locks, yet the program is race-free. The
				456	essential observation is that the child's creation and exit create
				457	synchronisation events between it and the parent. These force the
				458	child's accesses to <computeroutput>buf</computeroutput> to happen
				459	after the parent initialises <computeroutput>buf</computeroutput>, and
				460	before the parent reads the results
				461	from <computeroutput>buf</computeroutput>.</para>
				462
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	463	<para>To model this, Helgrind allows the child to steal, from the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	464	parent, exclusive ownership of any memory exclusively owned by the
				465	parent before the pthread_create call. Similarly, once the parent's
				466	pthread_join call returns, it can steal back ownership of memory
				467	exclusively owned by the child. In this way ownership
				468	of <computeroutput>buf</computeroutput> is transferred from parent to
				469	child and back, so the basic algorithm does not report any races
				470	despite the absence of any locking.</para>
				471
				472	<para>Note that the child may only steal memory owned by the parent
				473	prior to the pthread_create call. If the child attempts to read or
				474	write memory which is also accessed by the parent in between the
				475	pthread_create and pthread_join calls, an error is still
				476	reported.</para>
				477
				478	<para>This technique was introduced with the name "thread lifetime
				479	segments" in "Runtime Checking of Multithreaded Applications with
				480	Visual Threads" (Jerry J. Harrow, Jr, Proceedings of the 7th
				481	International SPIN Workshop on Model Checking of Software Stanford,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	482	California, USA, August 2000, LNCS 1885, pp331--342). Helgrind
				483	implements an extended version of it. Specifically, Helgrind allows
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	484	transfer of exclusive ownership in the following situations:</para>
				485
				486	<itemizedlist>
				487	<listitem><para>At thread creation: a child can acquire ownership of
				488	memory held exclusively by the parent prior to the child's
				489	creation.</para>
				490	</listitem>
				491	<listitem><para>At thread joining: the joiner (thread not exiting)
				492	can acquire ownership of memory held exclusively by the joinee
				493	(thread that is exiting) at the point it exited.</para>
				494	</listitem>
				495	<listitem><para>At condition variable signallings and broadcasts. A
				496	thread Tw which completes a pthread_cond_wait call as a result of
				497	a signal or broadcast on the same condition variable by some other
				498	thread Ts, may acquire ownership of memory held exclusively by
				499	Ts prior to the pthread_cond_signal/broadcast
				500	call.</para>
				501	</listitem>
				502	<listitem><para>At semaphore posts (sem_post) calls. A thread Tw
				503	which completes a sem_wait call call as a result of a sem_post call
				504	on the same semaphore by some other thread Tp, may acquire
				505	ownership of memory held exclusively by Tp prior to the sem_post
				506	call.</para>
				507	</listitem>
				508	</itemizedlist>
				509
				510	</sect2>
				511
				512
				513
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	514	<sect2 id="hg-manual.data-races.re-excl" xreflabel="Re-Excl Transfers">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	515	<title>Restoration of Exclusive Ownership</title>
				516
				517	<para>Another common idiom is to partition the lifetime of the program
				518	as a whole into several distinct phases. In some of those phases, a
				519	memory location may be accessed by multiple threads and so require
				520	locking. In other phases only one thread exists and so can access the
				521	memory without locking. For example:</para>
				522
				523	<programlisting><![CDATA[
				524	int var = 0; /* shared variable */
				525	pthread_mutex_t mx = PTHREAD_MUTEX_INITIALIZER; /* guard for var */
				526	pthread_t child;
				527
				528	/* ---- Parent ---- / / ---- Child ---- */
				529
				530	var += 1; /* no lock used */
				531
				532	pthread_create( &child, child_fn, NULL );
				533
				534	void child_fn ( void* uu ) {
				535	pthread_mutex_lock(&mx); pthread_mutex_lock(&mx);
				536	var += 2; var += 3;
				537	pthread_mutex_unlock(&mx); pthread_mutex_unlock(&mx);
				538	}
				539
				540	pthread_join ( child );
				541
				542	var += 4; /* no lock used */
				543	]]></programlisting>
				544
				545	<para>This program is correct, but using only the mechanisms described
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	546	so far, Helgrind would report an error at
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	547	<computeroutput>var += 4</computeroutput>. This is because, by that
				548	point, <computeroutput>var</computeroutput> is marked as being in the
				549	state "shared-modified and protected by the
				550	lock <computeroutput>mx</computeroutput>", but is being accessed
				551	without locking. Really, what we want is
				552	for <computeroutput>var</computeroutput> to return to the parent
				553	thread's exclusive ownership after the child thread has exited.</para>
				554
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	555	<para>To make this possible, for every memory location Helgrind also keeps
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	556	track of all the threads that have accessed that location
				557	-- its threadset. When a thread Tquitter joins back to Tstayer,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	558	Helgrind examines the locksets of all memory in shared-modified or
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	559	shared-readable state. In each such lockset, if Tquitter is
				560	mentioned, it is removed and replaced by Tstayer. If, as a result, a
				561	lockset becomes a singleton set containing Tstayer, then the
				562	location's state is changed to belongs-exclusively-to-Tstayer.</para>
				563
				564	<para>In our example, the result is exactly as we desire:
				565	<computeroutput>var</computeroutput> is reacquired exclusively by the
				566	parent after the child exits.</para>
				567
				568	<para>More generally, when a group of threads merges back to a single
				569	thread via a cascade of pthread_join calls, any memory shared by the
				570	group (or a subset of it) ends up being owned exclusively by the sole
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	571	surviving thread. This significantly enhances Helgrind's flexibility,
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	572	since it means that each memory location may make arbitrarily many
				573	transitions between exclusive and shared ownership. Furthermore, a
				574	different lock may protect the location during each period of shared
				575	ownership.</para>
				576
				577	</sect2>
				578
				579
				580
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	581	<sect2 id="hg-manual.data-races.summary" xreflabel="Race Det Summary">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	582	<title>A Summary of the Race Detection Algorithm</title>
				583
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	584	<para>Helgrind looks for memory locations which are accessed by more
				585	than one thread. For each such location, Helgrind records which of
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	586	the program's locks were held by the accessing thread at the time of
				587	each access. The hope is to discover that there is indeed at least
				588	one lock which is consistently used by all threads to protect that
				589	location. If no such lock can be found, then there is apparently no
				590	consistent locking strategy being applied for that location, and so a
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	591	possible data race might result. Helgrind accordingly reports an
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	592	error.</para>
				593
				594	<para>In practice this discipline is far too simplistic, and is
				595	unusable since it reports many races in some widely used and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	596	known-correct programming disciplines. Helgrind's checking therefore
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	597	incorporates many refinements to this basic idea, and can be
				598	summarised as follows:</para>
				599
				600	<para>The following thread events are intercepted and monitored:</para>
				601
				602	<itemizedlist>
				603	<listitem><para>thread creation and exiting (pthread_create,
				604	pthread_join, pthread_exit)</para>
				605	</listitem>
				606	<listitem>
				607	<para>lock acquisition and release (pthread_mutex_lock,
				608	pthread_mutex_unlock, pthread_rwlock_rdlock,
				609	pthread_rwlock_wrlock,
				610	pthread_rwlock_unlock)</para>
				611	</listitem>
				612	<listitem>
				613	<para>inter-thread event notifications (pthread_cond_wait,
				614	pthread_cond_signal, pthread_cond_broadcast,
				615	sem_wait, sem_post)</para>
				616	</listitem>
				617	</itemizedlist>
				618
				619	<para>Memory allocation and deallocation events are intercepted and
				620	monitored:</para>
				621
				622	<itemizedlist>
				623	<listitem>
				624	<para>malloc/new/free/delete and variants</para>
				625	</listitem>
				626	<listitem>
				627	<para>stack allocation and deallocation</para>
				628	</listitem>
				629	</itemizedlist>
				630
				631	<para>All memory accesses are intercepted and monitored.</para>
				632
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	633	<para>By observing the above events, Helgrind can infer certain
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	634	aspects of the program's locking discipline. Programs which adhere to
				635	the following rules are considered to be acceptable:
				636	</para>
				637
				638	<itemizedlist>
				639	<listitem>
				640	<para>A thread may allocate memory, and write initial values into
				641	it, without locking. That thread is regarded as owning the memory
				642	exclusively.</para>
				643	</listitem>
				644	<listitem>
				645	<para>A thread may read and write memory which it owns exclusively,
				646	without locking.</para>
				647	</listitem>
				648	<listitem>
				649	<para>Memory which is owned exclusively by one thread may be read by
				650	that thread and others without locking. However, in this situation
				651	no thread may do unlocked writes to the memory (except for the owner
				652	thread's initializing write).</para>
				653	</listitem>
				654	<listitem>
				655	<para>Memory which is shared between multiple threads, one or more
				656	of which writes to it, must be protected by a lock which is
				657	correctly acquired and released by all threads accessing the
				658	memory.</para>
				659	</listitem>
				660	</itemizedlist>
				661
				662	<para>Any violation of this discipline will cause an error to be reported.
				663	However, two exemptions apply:</para>
				664
				665	<itemizedlist>
				666	<listitem>
				667	<para>A thread Y can acquire exclusive ownership of memory
				668	previously owned exclusively by a different thread X providing
				669	X's last access and Y's first access are separated by one of the
				670	following synchronization events:</para>
				671	<itemizedlist>
				672	<listitem><para>X creates thread Y</para></listitem>
				673	<listitem><para>X joins back to Y</para></listitem>
				674	<listitem><para>X uses a condition-variable to signal at Y, and Y is
				675	waiting for that event</para></listitem>
				676	<listitem><para>Y completes a semaphore wait as a result of X signalling
				677	on that same semaphore</para></listitem>
				678	</itemizedlist>
				679	<para>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	680	This refinement allows Helgrind to correctly track the ownership
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	681	state of inter-thread buffers used in the worker-thread and
				682	worker-thread-pool concurrent programming idioms (styles).</para>
				683	</listitem>
				684	<listitem>
				685	<para>Similarly, if thread Y joins back to thread X, memory
				686	exclusively owned by Y becomes exclusively owned by X instead.
				687	Also, memory that has been shared only by X and Y becomes
				688	exclusively owned by X. More generally, memory that has been shared
				689	by X, Y and some arbitrary other set S of threads is re-marked as
				690	shared by X and S. Hence, under the right circumstances, memory
				691	shared amongst multiple threads, all of which join into just one,
				692	can revert to the exclusive ownership state.</para>
				693	<para>
				694	In effect, each memory location may make arbitrarily many
				695	transitions between exclusive and shared ownership. Furthermore, a
				696	different lock may protect the location during each period of shared
				697	ownership. This significantly enhances the flexibility of the
				698	algorithm.</para>
				699	</listitem>
				700	</itemizedlist>
				701
				702	<para>The ownership state, accessing thread-set and related lock-set
				703	for each memory location are tracked at 8-bit granularity. This means
				704	the algorithm is precise even for 16- and 8-bit memory
				705	accesses.</para>
				706
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	707	<para>Helgrind correctly handles reader-writer locks in this
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	708	framework. Locations shared between multiple threads can be protected
				709	during reads by locks held in either read-mode or write-mode, but can
				710	only be protected during writes by locks held in write-mode. Normal
				711	POSIX mutexes are treated as if they are reader-writer locks which are
				712	only ever held in write-mode.</para>
				713
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	714	<para>Helgrind correctly handles POSIX mutexes for which recursive
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	715	locking is allowed.</para>
				716
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	717	<para>Helgrind partially correctly handles x86 and amd64 memory access
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	718	instructions preceded by a LOCK prefix. Writes are correctly handled,
				719	by pretending that the LOCK prefix implies acquisition and release of
				720	a magic "bus hardware lock" mutex before and after the instruction.
				721	This unfortunately requires subsequent reads from such locations to
				722	also use a LOCK prefix, which is not required by the real hardware.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	723	Helgrind does not offer any equivalent handling for atomic sequences
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	724	on PowerPC/POWER platforms created by the use of lwarx/stwcx
				725	instructions.</para>
				726
				727	</sect2>
				728
				729
				730
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	731	<sect2 id="hg-manual.data-races.errmsgs" xreflabel="Race Error Messages">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	732	<title>Interpreting Race Error Messages</title>
				733
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	734	<para>Helgrind's race detection algorithm collects a lot of
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	735	information, and tries to present it in a helpful way when a race is
				736	detected. Here's an example:</para>
				737
				738	<programlisting><![CDATA[
				739	Thread #2 was created
				740	at 0x510548E: clone (in /lib64/libc-2.5.so)
				741	by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so)
				742	by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	743	by 0x4C23870: pthread_create@* (hg_intercepts.c:198)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	744	by 0x400CEF: main (tc17_sembar.c:195)
				745
				746	// And the same for threads #3, #4 and #5 -- omitted for conciseness
				747
				748	Possible data race during read of size 4 at 0x602174
				749	at 0x400BE5: gomp_barrier_wait (tc17_sembar.c:122)
				750	by 0x400C44: child (tc17_sembar.c:161)
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	751	by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	752	by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so)
				753	by 0x51054CC: clone (in /lib64/libc-2.5.so)
				754	Old state: shared-modified by threads #2, #3, #4, #5
				755	New state: shared-modified by threads #2, #3, #4, #5
				756	Reason: this thread, #2, holds no consistent locks
				757	Last consistently used lock for 0x602174 was first observed
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	758	at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	759	by 0x4009E4: gomp_barrier_init (tc17_sembar.c:46)
				760	by 0x400CBC: main (tc17_sembar.c:192)
				761	]]></programlisting>
				762
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	763	<para>Helgrind first announces the creation points of any threads
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	764	referenced in the error message. This is so it can speak concisely
				765	about threads and sets of threads without repeatedly printing their
				766	creation point call stacks. Each thread is only ever announced once,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	767	the first time it appears in any Helgrind error message.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	768
				769	<para>The main error message begins at the text
				770	"<computeroutput>Possible data race during read</computeroutput>".
				771	At the start is information you would expect to see -- address and
				772	size of the racing access, whether a read or a write, and the call
				773	stack at the point it was detected.</para>
				774
				775	<para>More interesting is the state transition caused by this access.
				776	This memory is already in the shared-modified state, and up to now has
				777	been consistently protected by at least one lock. However, the thread
				778	making the access in question (thread #2, here) does not hold any
				779	locks in common with those held during all previous accesses to the
				780	location -- "no consistent locks", in other words.</para>
				781
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	782	<para>Finally, Helgrind shows the lock which has protected this
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	783	location in all previous accesses. (If there is more than one, only
				784	one is shown). This can be a useful hint, because it typically shows
				785	the lock that the programmers intended to use to protect the location,
				786	but in this case forgot.</para>
				787
				788	<para>Here are some more examples of race reports. This not an
				789	exhaustive list of combinations, but should give you some insight into
				790	how to interpret the output.</para>
				791
				792	<programlisting><![CDATA[
				793	Possible data race during write ...
				794	Old state: shared-readonly by threads #1, #2, #3
				795	New state: shared-modified by threads #1, #2, #3
				796	Reason: this thread, #3, holds no consistent locks
				797	Location ... has never been protected by any lock
				798	]]></programlisting>
				799
				800	<para>The location is shared by 3 threads, all of which have been
				801	reading it without locking ("has never been protected by any lock").
				802	Now one of them is writing it. Regardless of whether the writer has a
				803	lock or not, this is still an error, because the write races against
				804	the previously observed reads.</para>
				805
				806	<programlisting><![CDATA[
				807	Possible data race during read ...
				808	Old state: shared-modified by threads #1, #2, #3
				809	New state: shared-modified by threads #1, #2, #3
				810	Reason: this thread, #3, holds no consistent locks
				811	Last consistently used lock for ... was first observed ...
				812	]]></programlisting>
				813
				814	<para>The location is shared by 3 threads, all of which have been
				815	reading and writing it while (as required) holding at least one lock
				816	in common. Now it is being read without that lock being held. In the
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	817	"Last consistently used lock" part, Helgrind offers its best guess as
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	818	to the identity of the lock that should have been used.</para>
				819
				820	<programlisting><![CDATA[
				821	Possible data race during write ...
				822	Old state: owned exclusively by thread #4
				823	New state: shared-modified by threads #4, #5
				824	Reason: this thread, #5, holds no locks at all
				825	]]></programlisting>
				826
				827	<para>A location that has so far been accessed exclusively by thread
				828	#4 has now been written by thread #5, without use of any lock. This
				829	can be a sign that the programmer did not consider the possibility of
				830	the location being shared between threads, or, alternatively, forgot
				831	to use the appropriate lock.</para>
				832
				833	<para>Note that thread #4 exclusively owns the location, and so has
				834	the right to access it without holding a lock. However, this message
				835	does not say that thread #4 is not using a lock for this location.
				836	Indeed, it could be using a lock for the location because it intends
				837	to make it available to other threads, one of which is thread #5 --
				838	and thread #5 has forgotten to use the lock.</para>
				839
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	840	<para>Also, this message implies that Helgrind did not see any
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	841	synchronisation event between threads #4 and #5 that would have
				842	allowed #5 to acquire exclusive ownership from #4. See
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	843	<link linkend="hg-manual.data-races.exclusive">above</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	844	for a discussion of transfers of exclusive ownership states between
				845	threads.</para>
				846
				847	</sect2>
				848
				849
				850	</sect1>
				851
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	852	<sect1 id="hg-manual.effective-use" xreflabel="Helgrind Effective Use">
				853	<title>Hints and Tips for Effective Use of Helgrind</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	854
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	855	<para>Helgrind can be very helpful in finding and resolving
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	856	threading-related problems. Like all sophisticated tools, it is most
				857	effective when you understand how to play to its strengths.</para>
				858
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	859	<para>Helgrind will be less effective when you merely throw an
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	860	existing threaded program at it and try to make sense of any reported
				861	errors. It will be more effective if you design threaded programs
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	862	from the start in a way that helps Helgrind verify correctness. The
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	863	same is true for finding memory errors with Memcheck, but applies more
				864	here, because thread checking is a harder problem. Consequently it is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	865	much easier to write a correct program for which Helgrind falsely
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	866	reports (threading) errors than it is to write a correct program for
				867	which Memcheck falsely reports (memory) errors.</para>
				868
				869	<para>With that in mind, here are some tips, listed most important first,
				870	for getting reliable results and avoiding false errors. The first two
				871	are critical. Any violations of them will swamp you with huge numbers
				872	of false data-race errors.</para>
				873
				874
				875	<orderedlist>
				876
				877	<listitem>
				878	<para>Make sure your application, and all the libraries it uses,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	879	use the POSIX threading primitives. Helgrind needs to be able to
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	880	see all events pertaining to thread creation, exit, locking and
				881	other syncronisation events. To do so it intercepts many POSIX
				882	pthread_ functions.</para>
				883
				884	<para>Do not roll your own threading primitives (mutexes, etc)
				885	from combinations of the Linux futex syscall, counters and wotnot.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	886	These throw Helgrind's internal what's-going-on models way off
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	887	course and will give bogus results.</para>
				888
				889	<para>Also, do not reimplement existing POSIX abstractions using
				890	other POSIX abstractions. For example, don't build your own
				891	semaphore routines or reader-writer locks from POSIX mutexes and
				892	condition variables. Instead use POSIX reader-writer locks and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	893	semaphores directly, since Helgrind supports them directly.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	894
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	895	<para>Helgrind directly supports the following POSIX threading
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	896	abstractions: mutexes, reader-writer locks, condition variables
				897	(but see below), and semaphores. Currently spinlocks and barriers
				898	are not supported, although they could be in future. A prototype
				899	"safe" implementation of barriers, based on semaphores, is
				900	available: please contact the Valgrind authors for details.</para>
				901
				902	<para>At the time of writing, the following popular Linux packages
				903	are known to implement their own threading primitives:</para>
				904
				905	<itemizedlist>
				906	<listitem><para>Qt version 4.X. Qt 3.X is fine, but not 4.X.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	907	Helgrind contains partial direct support for Qt 4.X threading,
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	908	but this is not yet in a usable state. Assistance from folks
				909	knowledgeable in Qt 4 threading internals would be
				910	appreciated.</para></listitem>
				911
				912	<listitem><para>Runtime support library for GNU OpenMP (part of
				913	GCC), at least GCC versions 4.2 and 4.3. With some minor effort
				914	of modifying the GNU OpenMP runtime support sources, it is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	915	possible to use Helgrind on GNU OpenMP compiled codes. Please
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	916	contact the Valgrind authors for details.</para></listitem>
				917	</itemizedlist>
				918	</listitem>
				919
				920	<listitem>
				921	<para>Avoid memory recycling. If you can't avoid it, you must use
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	922	tell Helgrind what is going on via the VALGRIND_HG_CLEAN_MEMORY
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	923	client request
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	924	(in <computeroutput>helgrind.h</computeroutput>).</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	925
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	926	<para>Helgrind is aware of standard memory allocation and
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	927	deallocation that occurs via malloc/free/new/delete and from entry
				928	and exit of stack frames. In particular, when memory is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	929	deallocated via free, delete, or function exit, Helgrind considers
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	930	that memory clean, so when it is eventually reallocated, its
				931	history is irrelevant.</para>
				932
				933	<para>However, it is common practice to implement memory recycling
				934	schemes. In these, memory to be freed is not handed to
				935	malloc/delete, but instead put into a pool of free buffers to be
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	936	handed out again as required. The problem is that Helgrind has no
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	937	way to know that such memory is logically no longer in use, and
				938	its history is irrelevant. Hence you must make that explicit,
				939	using the VALGRIND_HG_CLEAN_MEMORY client request to specify the
				940	relevant address ranges. It's easiest to put these requests into
				941	the pool manager code, and use them either when memory is returned
				942	to the pool, or is allocated from it.</para>
				943	</listitem>
				944
				945	<listitem>
				946	<para>Avoid POSIX condition variables. If you can, use POSIX
				947	semaphores (sem_t, sem_post, sem_wait) to do inter-thread event
				948	signalling. Semaphores with an initial value of zero are
				949	particularly useful for this.</para>
				950
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	951	<para>Helgrind only partially correctly handles POSIX condition
				952	variables. This is because Helgrind can see inter-thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	953	dependencies between a pthread_cond_wait call and a
				954	pthread_cond_signal/broadcast call only if the waiting thread
				955	actually gets to the rendezvous first (so that it actually calls
				956	pthread_cond_wait). It can't see dependencies between the threads
				957	if the signaller arrives first. In the latter case, POSIX
				958	guidelines imply that the associated boolean condition still
				959	provides an inter-thread synchronisation event, but one which is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	960	invisible to Helgrind.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	961
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	962	<para>The result of Helgrind missing some inter-thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	963	synchronisation events is to cause it to report false positives.
				964	That's because missing such events reduces the extent to which it
				965	can transfer exclusive memory ownership between threads. So
				966	memory may end up in a shared-modified state when that was not
				967	intended by the application programmers.</para>
				968
				969	<para>The root cause of this synchronisation lossage is
				970	particularly hard to understand, so an example is helpful. It was
				971	discussed at length by Arndt Muehlenfeld ("Runtime Race Detection
				972	in Multi-Threaded Programs", Dissertation, TU Graz, Austria). The
				973	canonical POSIX-recommended usage scheme for condition variables
				974	is as follows:</para>
				975
				976	<programlisting><![CDATA[
				977	b is a Boolean condition, which is False most of the time
				978	cv is a condition variable
				979	mx is its associated mutex
				980
				981	Signaller: Waiter:
				982
				983	lock(mx) lock(mx)
				984	b = True while (b == False)
				985	signal(cv) wait(cv,mx)
				986	unlock(mx) unlock(mx)
				987	]]></programlisting>
				988
				989	<para>Assume <computeroutput>b</computeroutput> is False most of
				990	the time. If the waiter arrives at the rendezvous first, it
				991	enters its while-loop, waits for the signaller to signal, and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	992	eventually proceeds. Helgrind sees the signal, notes the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	993	dependency, and all is well.</para>
				994
				995	<para>If the signaller arrives
				996	first, <computeroutput>b</computeroutput> is set to true, and the
				997	signal disappears into nowhere. When the waiter later arrives, it
				998	does not enter its while-loop and simply carries on. But even in
				999	this case, the waiter code following the while-loop cannot execute
				1000	until the signaller sets <computeroutput>b</computeroutput> to
				1001	True. Hence there is still the same inter-thread dependency, but
				1002	this time it is through an arbitrary in-memory condition, and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1003	Helgrind cannot see it.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1004
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1005	<para>By comparison, Helgrind's detection of inter-thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1006	dependencies caused by semaphore operations is believed to be
				1007	exactly correct.</para>
				1008
				1009	<para>As far as I know, a solution to this problem that does not
				1010	require source-level annotation of condition-variable wait loops
				1011	is beyond the current state of the art.</para>
				1012	</listitem>
				1013
				1014	<listitem>
				1015	<para>Make sure you are using a supported Linux distribution. At
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1016	present, Helgrind only properly supports x86-linux and amd64-linux
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1017	with glibc-2.3 or later. The latter restriction means we only
				1018	support glibc's NPTL threading implementation. The old
				1019	LinuxThreads implementation is not supported.</para>
				1020
				1021	<para>Unsupported targets may work to varying degrees. In
				1022	particular ppc32-linux and ppc64-linux running NTPL should work,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1023	but you will get false race errors because Helgrind does not know
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1024	how to properly handle atomic instruction sequences created using
				1025	the lwarx/stwcx instructions.</para>
				1026	</listitem>
				1027
				1028	<listitem>
				1029	<para>Round up all finished threads using pthread_join. Avoid
				1030	detaching threads: don't create threads in the detached state, and
				1031	don't call pthread_detach on existing threads.</para>
				1032
				1033	<para>Using pthread_join to round up finished threads provides a
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1034	clear synchronisation point that both Helgrind and programmers can
				1035	see. This synchronisation point allows Helgrind to adjust its
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1036	memory ownership
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1037	models <link linkend="hg-manual.data-races.exclusive">as described
				1038	extensively above</link>, which helps Helgrind produce more
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1039	accurate error reports.</para>
				1040
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1041	<para>If you don't call pthread_join on a thread, Helgrind has no
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1042	way to know when it finishes, relative to any significant
				1043	synchronisation points for other threads in the program. So it
				1044	assumes that the thread lingers indefinitely and can potentially
				1045	interfere indefinitely with the memory state of the program. It
				1046	has every right to assume that -- after all, it might really be
				1047	the case that, for scheduling reasons, the exiting thread did run
				1048	very slowly in the last stages of its life.</para>
				1049	</listitem>
				1050
				1051	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1052	<para>Perform thread debugging (with Helgrind) and memory
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1053	debugging (with Memcheck) together.</para>
				1054
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1055	<para>Helgrind tracks the state of memory in detail, and memory
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1056	management bugs in the application are liable to cause confusion.
				1057	In extreme cases, applications which do many invalid reads and
				1058	writes (particularly to freed memory) have been known to crash
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1059	Helgrind. So, ideally, you should make your application
				1060	Memcheck-clean before using Helgrind.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1061
				1062	<para>It may be impossible to make your application Memcheck-clean
				1063	unless you first remove threading bugs. In particular, it may be
				1064	difficult to remove all reads and writes to freed memory in
				1065	multithreaded C++ destructor sequences at program termination.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1066	So, ideally, you should make your application Helgrind-clean
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1067	before using Memcheck.</para>
				1068
				1069	<para>Since this circularity is obviously unresolvable, at least
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1070	bear in mind that Memcheck and Helgrind are to some extent
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1071	complementary, and you may need to use them together.</para>
				1072	</listitem>
				1073
				1074	<listitem>
				1075	<para>POSIX requires that implementations of standard I/O (printf,
				1076	fprintf, fwrite, fread, etc) are thread safe. Unfortunately GNU
				1077	libc implements this by using internal locking primitives that
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1078	Helgrind is unable to intercept. Consequently Helgrind generates
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1079	many false race reports when you use these functions.</para>
				1080
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1081	<para>Helgrind attempts to hide these errors using the standard
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1082	Valgrind error-suppression mechanism. So, at least for simple
				1083	test cases, you don't see any. Nevertheless, some may slip
				1084	through. Just something to be aware of.</para>
				1085	</listitem>
				1086
				1087	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1088	<para>Helgrind's error checks do not work properly inside the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1089	system threading library itself
				1090	(<computeroutput>libpthread.so</computeroutput>), and it usually
				1091	observes large numbers of (false) errors in there. Valgrind's
				1092	suppression system then filters these out, so you should not see
				1093	them.</para>
				1094
				1095	<para>If you see any race errors reported
				1096	where <computeroutput>libpthread.so</computeroutput> or
				1097	<computeroutput>ld.so</computeroutput> is the object associated
				1098	with the innermost stack frame, please file a bug report at
				1099	http://www.valgrind.org.</para>
				1100	</listitem>
				1101
				1102	</orderedlist>
				1103
				1104	</sect1>
				1105
				1106
				1107
				1108
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1109	<sect1 id="hg-manual.options" xreflabel="Helgrind Options">
				1110	<title>Helgrind Options</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1111
				1112	<para>The following end-user options are available:</para>
				1113
				1114	<!-- start of xi:include in the manpage -->
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1115	<variablelist id="hg.opts.list">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1116
				1117	<varlistentry id="opt.happens-before" xreflabel="--happens-before">
				1118	<term>
				1119	<option><![CDATA[--happens-before=none\|threads\|all
				1120	[default: all] ]]></option>
				1121	</term>
				1122	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1123	<para>Helgrind always regards locks as the basis for
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1124	inter-thread synchronisation. However, by default, before
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1125	reporting a race error, Helgrind will also check whether
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1126	certain other kinds of inter-thread synchronisation events
				1127	happened. It may be that if such events took place, then no
				1128	race really occurred, and so no error needs to be reported.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1129	See <link linkend="hg-manual.data-races.exclusive">above</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1130	for a discussion of transfers of exclusive ownership states
				1131	between threads.
				1132	</para>
				1133	<para>With <varname>--happens-before=all</varname>, the
				1134	following events are regarded as sources of synchronisation:
				1135	thread creation/joinage, condition variable
				1136	signal/broadcast/waits, and semaphore posts/waits.
				1137	</para>
				1138	<para>With <varname>--happens-before=threads</varname>, only
				1139	thread creation/joinage events are regarded as sources of
				1140	synchronisation.
				1141	</para>
				1142	<para>With <varname>--happens-before=none</varname>, no events
				1143	(apart, of course, from locking) are regarded as sources of
				1144	synchronisation.
				1145	</para>
				1146	<para>Changing this setting from the default will increase your
				1147	false-error rate but give little or no gain. The only advantage
				1148	is that <option>--happens-before=threads</option> and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1149	<option>--happens-before=none</option> should make Helgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1150	less and less sensitive to the scheduling of threads, and hence
				1151	the output more and more repeatable across runs.
				1152	</para>
				1153	</listitem>
				1154	</varlistentry>
				1155
				1156	<varlistentry id="opt.trace-addr" xreflabel="--trace-addr">
				1157	<term>
				1158	<option><![CDATA[--trace-addr=0xXXYYZZ
				1159	]]></option> and
				1160	<option><![CDATA[--trace-level=0\|1\|2 [default: 1]
				1161	]]></option>
				1162	</term>
				1163	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1164	<para>Requests that Helgrind produces a log of all state changes
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1165	to location 0xXXYYZZ. This can be helpful in tracking down
				1166	tricky races. <varname>--trace-level</varname> controls the
				1167	verbosity of the log. At the default setting (1), a one-line
				1168	summary of is printed for each state change. At level 2 a
				1169	complete stack trace is printed for each state change.</para>
				1170	</listitem>
				1171	</varlistentry>
				1172
				1173	</variablelist>
				1174	<!-- end of xi:include in the manpage -->
				1175
				1176	<!-- start of xi:include in the manpage -->
				1177	<para>In addition, the following debugging options are available for
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1178	Helgrind:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1179
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1180	<variablelist id="hg.debugopts.list">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1181
				1182	<varlistentry id="opt.trace-malloc" xreflabel="--trace-malloc">
				1183	<term>
				1184	<option><![CDATA[--trace-malloc=no\|yes [no]
				1185	]]></option>
				1186	</term>
				1187	<listitem>
				1188	<para>Show all client malloc (etc) and free (etc) requests.</para>
				1189	</listitem>
				1190	</varlistentry>
				1191
				1192	<varlistentry id="opt.gen-vcg" xreflabel="--gen-vcg">
				1193	<term>
				1194	<option><![CDATA[--gen-vcg=no\|yes\|yes-w-vts [no]
				1195	]]></option>
				1196	</term>
				1197	<listitem>
				1198	<para>At exit, write to stderr a dump of the happens-before
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1199	graph computed by Helgrind, in a format suitable for the VCG
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1200	graph visualisation tool. A suitable command line is:</para>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1201	<para><computeroutput>valgrind --tool=helgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1202	--gen-vcg=yes my_app 2>&1
				1203	\| grep xxxxxx \| sed "s/xxxxxx//g"
				1204	\| xvcg -</computeroutput></para>
				1205	<para>With <varname>--gen-vcg=yes</varname>, the basic
				1206	happens-before graph is shown. With
				1207	<varname>--gen-vcg=yes-w-vts</varname>, the vector timestamp
				1208	for each node is also shown.</para>
				1209	</listitem>
				1210	</varlistentry>
				1211
				1212	<varlistentry id="opt.cmp-race-err-addrs"
				1213	xreflabel="--cmp-race-err-addrs">
				1214	<term>
				1215	<option><![CDATA[--cmp-race-err-addrs=no\|yes [no]
				1216	]]></option>
				1217	</term>
				1218	<listitem>
				1219	<para>Controls whether or not race (data) addresses should be
				1220	taken into account when removing duplicates of race errors.
				1221	With <varname>--cmp-race-err-addrs=no</varname>, two otherwise
				1222	identical race errors will be considered to be the same if
				1223	their race addresses differ. With
				1224	With <varname>--cmp-race-err-addrs=yes</varname> they will be
				1225	considered different. This is provided to help make certain
				1226	regression tests work reliably.</para>
				1227	</listitem>
				1228	</varlistentry>
				1229
				1230	<varlistentry id="opt.tc-sanity-flags" xreflabel="--tc-sanity-flags">
				1231	<term>
				1232	<option><![CDATA[--tc-sanity-flags=<XXXXX> (X = 0\|1) [00000]
				1233	]]></option>
				1234	</term>
				1235	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1236	<para>Run extensive sanity checks on Helgrind's internal
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1237	data structures at events defined by the bitstring, as
				1238	follows:</para>
				1239	<para><computeroutput>10000 </computeroutput>after changes to
				1240	the lock order acquisition graph</para>
				1241	<para><computeroutput>01000 </computeroutput>after every client
				1242	memory access (NB: not currently used)</para>
				1243	<para><computeroutput>00100 </computeroutput>after every client
				1244	memory range permission setting of 256 bytes or greater</para>
				1245	<para><computeroutput>00010 </computeroutput>after every client
				1246	lock or unlock event</para>
				1247	<para><computeroutput>00001 </computeroutput>after every client
				1248	thread creation or joinage event</para>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1249	<para>Note these will make Helgrind run very slowly, often to
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1250	the point of being completely unusable.</para>
				1251	</listitem>
				1252	</varlistentry>
				1253
				1254	</variablelist>
				1255	<!-- end of xi:include in the manpage -->
				1256
				1257
				1258	</sect1>
				1259
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1260	<sect1 id="hg-manual.todolist" xreflabel="To Do List">
				1261	<title>A To-Do List for Helgrind</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1262
				1263	<para>The following is a list of loose ends which should be tidied up
				1264	some time.</para>
				1265
				1266	<itemizedlist>
				1267	<listitem><para>Track which mutexes are associated with which
				1268	condition variables, and emit a warning if this becomes
				1269	inconsistent.</para>
				1270	</listitem>
				1271	<listitem><para>For lock order errors, print the complete lock
				1272	cycle, rather than only doing for size-2 cycles as at
				1273	present.</para>
				1274	</listitem>
				1275	<listitem><para>Document the VALGRIND_HG_CLEAN_MEMORY client
				1276	request.</para>
				1277	</listitem>
				1278	<listitem><para>Possibly a client request to forcibly transfer
				1279	ownership of memory from one thread to another. Requires further
				1280	consideration.</para>
				1281	</listitem>
				1282	<listitem><para>Add a new client request that marks an address range
				1283	as being "shared-modified with empty lockset" (the error state),
				1284	and describe how to use it.</para>
				1285	</listitem>
				1286	<listitem><para>Document races caused by gcc's thread-unsafe code
				1287	generation for speculative stores. In the interim see
				1288	<computeroutput>http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html
				1289	</computeroutput>
				1290	and <computeroutput>http://lkml.org/lkml/2007/10/24/673</computeroutput>.
				1291	</para>
				1292	</listitem>
				1293	<listitem><para>Don't update the lock-order graph, and don't check
				1294	for errors, when a "try"-style lock operation happens (eg
				1295	pthread_mutex_trylock). Such calls do not add any real
				1296	restrictions to the locking order, since they can always fail to
				1297	acquire the lock, resulting in the caller going off and doing Plan
				1298	B (presumably it will have a Plan B). Doing such checks could
				1299	generate false lock-order errors and confuse users.</para>
				1300	</listitem>
				1301	<listitem><para> Performance can be very poor. Slowdowns on the
				1302	order of 100:1 are not unusual. There is quite some scope for
				1303	performance improvements, though.
				1304	</para>
				1305	</listitem>
				1306
				1307	</itemizedlist>
				1308
				1309	</sect1>
				1310
				1311	</chapter>