Blame - helgrind/docs/hg-manual.xml - platform/external/valgrind

blob: ff7d65bdce95f8f28f89b65610d24faa573f013c [file] [log] [blame]

sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1	<?xml version="1.0"?> <!-- -- sgml -- -->
				2	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame]	3	"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
				4	[ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	5
				6
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	7	<chapter id="hg-manual" xreflabel="Helgrind: thread error detector">
				8	<title>Helgrind: a thread error detector</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	9
				10	<para>To use this tool, you must specify
njn	7e5d4ed	2009-07-30 02:57:52 +0000	[diff] [blame]	11	<option>--tool=helgrind</option> on the Valgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	12	command line.</para>
				13
				14
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	15	<sect1 id="hg-manual.overview" xreflabel="Overview">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	16	<title>Overview</title>
				17
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	18	<para>Helgrind is a Valgrind tool for detecting synchronisation errors
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	19	in C, C++ and Fortran programs that use the POSIX pthreads
				20	threading primitives.</para>
				21
				22	<para>The main abstractions in POSIX pthreads are: a set of threads
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	23	sharing a common address space, thread creation, thread joining,
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	24	thread exit, mutexes (locks), condition variables (inter-thread event
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	25	notifications), reader-writer locks, spinlocks, semaphores and
				26	barriers.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	27
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	28	<para>Helgrind can detect three classes of errors, which are discussed
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	29	in detail in the next three sections:</para>
				30
				31	<orderedlist>
				32	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	33	<para><link linkend="hg-manual.api-checks">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	34	Misuses of the POSIX pthreads API.</link></para>
				35	</listitem>
				36	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	37	<para><link linkend="hg-manual.lock-orders">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	38	Potential deadlocks arising from lock
				39	ordering problems.</link></para>
				40	</listitem>
				41	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	42	<para><link linkend="hg-manual.data-races">
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	43	Data races -- accessing memory without adequate locking
				44	or synchronisation</link>.
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	45	</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	46	</listitem>
				47	</orderedlist>
				48
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	49	<para>Problems like these often result in unreproducible,
				50	timing-dependent crashes, deadlocks and other misbehaviour, and
				51	can be difficult to find by other means.</para>
				52
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	53	<para>Helgrind is aware of all the pthread abstractions and tracks
				54	their effects as accurately as it can. On x86 and amd64 platforms, it
				55	understands and partially handles implicit locking arising from the
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	56	use of the LOCK instruction prefix. On PowerPC/POWER and ARM
				57	platforms, it partially handles implicit locking arising from
				58	load-linked and store-conditional instruction pairs.
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	59	</para>
				60
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	61	<para>Helgrind works best when your application uses only the POSIX
				62	pthreads API. However, if you want to use custom threading
				63	primitives, you can describe their behaviour to Helgrind using the
				64	<varname>ANNOTATE_*</varname> macros defined
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	65	in <varname>helgrind.h</varname>.</para>
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	66
Elliott Hughes	ed39800	2017-06-21 14:41:24 -0700	[diff] [blame^]	67	<para>Helgrind also provides <xref linkend="manual-core.xtree"/> memory
				68	profiling using the command line
				69	option <computeroutput>--xtree-memory</computeroutput> and the monitor command
				70	<computeroutput>xtmemory</computeroutput>.</para>
				71
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	72
njn	05a8917	2009-07-29 02:36:21 +0000	[diff] [blame]	73
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	74	<para>Following those is a section containing
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	75	<link linkend="hg-manual.effective-use">
				76	hints and tips on how to get the best out of Helgrind.</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	77	</para>
				78
				79	<para>Then there is a
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	80	<link linkend="hg-manual.options">summary of command-line
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	81	options.</link>
				82	</para>
				83
				84	<para>Finally, there is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	85	<link linkend="hg-manual.todolist">a brief summary of areas in which Helgrind
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	86	could be improved.</link>
				87	</para>
				88
				89	</sect1>
				90
				91
				92
				93
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	94	<sect1 id="hg-manual.api-checks" xreflabel="API Checks">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	95	<title>Detected errors: Misuses of the POSIX pthreads API</title>
				96
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	97	<para>Helgrind intercepts calls to many POSIX pthreads functions, and
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	98	is therefore able to report on various common problems. Although
				99	these are unglamourous errors, their presence can lead to undefined
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	100	program behaviour and hard-to-find bugs later on. The detected errors
				101	are:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	102
				103	<itemizedlist>
				104	<listitem><para>unlocking an invalid mutex</para></listitem>
				105	<listitem><para>unlocking a not-locked mutex</para></listitem>
				106	<listitem><para>unlocking a mutex held by a different
				107	thread</para></listitem>
				108	<listitem><para>destroying an invalid or a locked mutex</para></listitem>
				109	<listitem><para>recursively locking a non-recursive mutex</para></listitem>
				110	<listitem><para>deallocation of memory that contains a
				111	locked mutex</para></listitem>
				112	<listitem><para>passing mutex arguments to functions expecting
				113	reader-writer lock arguments, and vice
				114	versa</para></listitem>
				115	<listitem><para>when a POSIX pthread function fails with an
				116	error code that must be handled</para></listitem>
				117	<listitem><para>when a thread exits whilst still holding locked
				118	locks</para></listitem>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	119	<listitem><para>calling <function>pthread_cond_wait</function>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	120	with a not-locked mutex, an invalid mutex,
				121	or one locked by a different
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	122	thread</para></listitem>
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	123	<listitem><para>inconsistent bindings between condition
				124	variables and their associated mutexes</para></listitem>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	125	<listitem><para>invalid or duplicate initialisation of a pthread
				126	barrier</para></listitem>
				127	<listitem><para>initialisation of a pthread barrier on which threads
				128	are still waiting</para></listitem>
				129	<listitem><para>destruction of a pthread barrier object which was
				130	never initialised, or on which threads are still
				131	waiting</para></listitem>
				132	<listitem><para>waiting on an uninitialised pthread
				133	barrier</para></listitem>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	134	<listitem><para>for all of the pthreads functions that Helgrind
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	135	intercepts, an error is reported, along with a stack
				136	trace, if the system threading library routine returns
				137	an error code, even if Helgrind itself detected no
				138	error</para></listitem>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	139	</itemizedlist>
				140
				141	<para>Checks pertaining to the validity of mutexes are generally also
				142	performed for reader-writer locks.</para>
				143
				144	<para>Various kinds of this-can't-possibly-happen events are also
				145	reported. These usually indicate bugs in the system threading
				146	library.</para>
				147
				148	<para>Reported errors always contain a primary stack trace indicating
				149	where the error was detected. They may also contain auxiliary stack
				150	traces giving additional information. In particular, most errors
				151	relating to mutexes will also tell you where that mutex first came to
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	152	Helgrind's attention (the "<computeroutput>was first observed
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	153	at</computeroutput>" part), so you have a chance of figuring out which
				154	mutex it is referring to. For example:</para>
				155
				156	<programlisting><![CDATA[
				157	Thread #1 unlocked a not-locked lock at 0x7FEFFFA90
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	158	at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.c:492)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	159	by 0x40073A: nearly_main (tc09_bad_unlock.c:27)
				160	by 0x40079B: main (tc09_bad_unlock.c:50)
				161	Lock at 0x7FEFFFA90 was first observed
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	162	at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	163	by 0x40071F: nearly_main (tc09_bad_unlock.c:23)
				164	by 0x40079B: main (tc09_bad_unlock.c:50)
				165	]]></programlisting>
				166
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	167	<para>Helgrind has a way of summarising thread identities, as
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	168	you see here with the text "<computeroutput>Thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	169	#1</computeroutput>". This is so that it can speak about threads and
				170	sets of threads without overwhelming you with details. See
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	171	<link linkend="hg-manual.data-races.errmsgs">below</link>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	172	for more information on interpreting error messages.</para>
				173
				174	</sect1>
				175
				176
				177
				178
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	179	<sect1 id="hg-manual.lock-orders" xreflabel="Lock Orders">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	180	<title>Detected errors: Inconsistent Lock Orderings</title>
				181
				182	<para>In this section, and in general, to "acquire" a lock simply
				183	means to lock that lock, and to "release" a lock means to unlock
				184	it.</para>
				185
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	186	<para>Helgrind monitors the order in which threads acquire locks.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	187	This allows it to detect potential deadlocks which could arise from
				188	the formation of cycles of locks. Detecting such inconsistencies is
				189	useful because, whilst actual deadlocks are fairly obvious, potential
				190	deadlocks may never be discovered during testing and could later lead
				191	to hard-to-diagnose in-service failures.</para>
				192
				193	<para>The simplest example of such a problem is as
				194	follows.</para>
				195
				196	<itemizedlist>
				197	<listitem><para>Imagine some shared resource R, which, for whatever
				198	reason, is guarded by two locks, L1 and L2, which must both be held
				199	when R is accessed.</para>
				200	</listitem>
				201	<listitem><para>Suppose a thread acquires L1, then L2, and proceeds
				202	to access R. The implication of this is that all threads in the
				203	program must acquire the two locks in the order first L1 then L2.
				204	Not doing so risks deadlock.</para>
				205	</listitem>
				206	<listitem><para>The deadlock could happen if two threads -- call them
				207	T1 and T2 -- both want to access R. Suppose T1 acquires L1 first,
				208	and T2 acquires L2 first. Then T1 tries to acquire L2, and T2 tries
				209	to acquire L1, but those locks are both already held. So T1 and T2
				210	become deadlocked.</para>
				211	</listitem>
				212	</itemizedlist>
				213
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	214	<para>Helgrind builds a directed graph indicating the order in which
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	215	locks have been acquired in the past. When a thread acquires a new
				216	lock, the graph is updated, and then checked to see if it now contains
				217	a cycle. The presence of a cycle indicates a potential deadlock involving
				218	the locks in the cycle.</para>
				219
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	220	<para>In general, Helgrind will choose two locks involved in the cycle
				221	and show you how their acquisition ordering has become inconsistent.
				222	It does this by showing the program points that first defined the
				223	ordering, and the program points which later violated it. Here is a
				224	simple example involving just two locks:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	225
				226	<programlisting><![CDATA[
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	227	Thread #1: lock order "0x7FF0006D0 before 0x7FF0006A0" violated
				228
				229	Observed (incorrect) order is: acquisition of lock at 0x7FF0006A0
				230	at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494)
				231	by 0x400825: main (tc13_laog1.c:23)
				232
				233	followed by a later acquisition of lock at 0x7FF0006D0
				234	at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494)
				235	by 0x400853: main (tc13_laog1.c:24)
				236
				237	Required order was established by acquisition of lock at 0x7FF0006D0
				238	at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494)
				239	by 0x40076D: main (tc13_laog1.c:17)
				240
				241	followed by a later acquisition of lock at 0x7FF0006A0
				242	at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494)
				243	by 0x40079B: main (tc13_laog1.c:18)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	244	]]></programlisting>
				245
				246	<para>When there are more than two locks in the cycle, the error is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	247	equally serious. However, at present Helgrind does not show the locks
philippe	ebe2580	2013-01-30 23:21:34 +0000	[diff] [blame]	248	involved, sometimes because that information is not available, but
				249	also so as to avoid flooding you with information. For example, a
				250	naive implementation of the famous Dining Philosophers problem
				251	involves a cycle of five locks
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	252	(see <computeroutput>helgrind/tests/tc14_laog_dinphils.c</computeroutput>).
				253	In this case Helgrind has detected that all 5 philosophers could
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	254	simultaneously pick up their left fork and then deadlock whilst
				255	waiting to pick up their right forks.</para>
				256
				257	<programlisting><![CDATA[
philippe	ebe2580	2013-01-30 23:21:34 +0000	[diff] [blame]	258	Thread #6: lock order "0x80499A0 before 0x8049A00" violated
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	259
philippe	ebe2580	2013-01-30 23:21:34 +0000	[diff] [blame]	260	Observed (incorrect) order is: acquisition of lock at 0x8049A00
				261	at 0x40085BC: pthread_mutex_lock (hg_intercepts.c:495)
				262	by 0x80485B4: dine (tc14_laog_dinphils.c:18)
				263	by 0x400BDA4: mythread_wrapper (hg_intercepts.c:219)
				264	by 0x39B924: start_thread (pthread_create.c:297)
				265	by 0x2F107D: clone (clone.S:130)
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	266
philippe	ebe2580	2013-01-30 23:21:34 +0000	[diff] [blame]	267	followed by a later acquisition of lock at 0x80499A0
				268	at 0x40085BC: pthread_mutex_lock (hg_intercepts.c:495)
				269	by 0x80485CD: dine (tc14_laog_dinphils.c:19)
				270	by 0x400BDA4: mythread_wrapper (hg_intercepts.c:219)
				271	by 0x39B924: start_thread (pthread_create.c:297)
				272	by 0x2F107D: clone (clone.S:130)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	273	]]></programlisting>
				274
				275	</sect1>
				276
				277
				278
				279
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	280	<sect1 id="hg-manual.data-races" xreflabel="Data Races">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	281	<title>Detected errors: Data Races</title>
				282
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	283	<para>A data race happens, or could happen, when two threads access a
				284	shared memory location without using suitable locks or other
				285	synchronisation to ensure single-threaded access. Such missing
				286	locking can cause obscure timing dependent bugs. Ensuring programs
				287	are race-free is one of the central difficulties of threaded
				288	programming.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	289
				290	<para>Reliably detecting races is a difficult problem, and most
sewardj	49d5a28	2011-02-28 10:26:42 +0000	[diff] [blame]	291	of Helgrind's internals are devoted to dealing with it.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	292	We begin with a simple example.</para>
				293
				294
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	295	<sect2 id="hg-manual.data-races.example" xreflabel="Simple Race">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	296	<title>A Simple Data Race</title>
				297
				298	<para>About the simplest possible example of a race is as follows. In
				299	this program, it is impossible to know what the value
				300	of <computeroutput>var</computeroutput> is at the end of the program.
				301	Is it 2 ? Or 1 ?</para>
				302
				303	<programlisting><![CDATA[
				304	#include <pthread.h>
				305
				306	int var = 0;
				307
				308	void* child_fn ( void* arg ) {
				309	var++; /* Unprotected relative to parent / / this is line 6 */
				310	return NULL;
				311	}
				312
				313	int main ( void ) {
				314	pthread_t child;
				315	pthread_create(&child, NULL, child_fn, NULL);
				316	var++; /* Unprotected relative to child / / this is line 13 */
				317	pthread_join(child, NULL);
				318	return 0;
				319	}
				320	]]></programlisting>
				321
				322	<para>The problem is there is nothing to
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	323	stop <varname>var</varname> being updated simultaneously
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	324	by both threads. A correct program would
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	325	protect <varname>var</varname> with a lock of type
				326	<function>pthread_mutex_t</function>, which is acquired
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	327	before each access and released afterwards. Helgrind's output for
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	328	this program is:</para>
				329
				330	<programlisting><![CDATA[
				331	Thread #1 is the program's root thread
				332
				333	Thread #2 was created
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	334	at 0x511C08E: clone (in /lib64/libc-2.8.so)
				335	by 0x4E333A4: do_clone (in /lib64/libpthread-2.8.so)
				336	by 0x4E33A30: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.8.so)
				337	by 0x4C299D4: pthread_create@* (hg_intercepts.c:214)
				338	by 0x400605: main (simple_race.c:12)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	339
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	340	Possible data race during read of size 4 at 0x601038 by thread #1
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	341	Locks held: none
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	342	at 0x400606: main (simple_race.c:13)
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	343
				344	This conflicts with a previous write of size 4 by thread #2
				345	Locks held: none
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	346	at 0x4005DC: child_fn (simple_race.c:6)
				347	by 0x4C29AFF: mythread_wrapper (hg_intercepts.c:194)
				348	by 0x4E3403F: start_thread (in /lib64/libpthread-2.8.so)
				349	by 0x511C0CC: clone (in /lib64/libc-2.8.so)
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	350
				351	Location 0x601038 is 0 bytes inside global var "var"
				352	declared at simple_race.c:3
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	353	]]></programlisting>
				354
				355	<para>This is quite a lot of detail for an apparently simple error.
				356	The last clause is the main error message. It says there is a race as
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	357	a result of a read of size 4 (bytes), at 0x601038, which is the
				358	address of <computeroutput>var</computeroutput>, happening in
				359	function <computeroutput>main</computeroutput> at line 13 in the
				360	program.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	361
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	362	<para>Two important parts of the message are:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	363
				364	<itemizedlist>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	365	<listitem>
				366	<para>Helgrind shows two stack traces for the error, not one. By
				367	definition, a race involves two different threads accessing the
				368	same location in such a way that the result depends on the relative
				369	speeds of the two threads.</para>
				370	<para>
				371	The first stack trace follows the text "<computeroutput>Possible
				372	data race during read of size 4 ...</computeroutput>" and the
				373	second trace follows the text "<computeroutput>This conflicts with
				374	a previous write of size 4 ...</computeroutput>". Helgrind is
				375	usually able to show both accesses involved in a race. At least
				376	one of these will be a write (since two concurrent, unsynchronised
				377	reads are harmless), and they will of course be from different
				378	threads.</para>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	379	<para>By examining your program at the two locations, you should be
				380	able to get at least some idea of what the root cause of the
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	381	problem is. For each location, Helgrind shows the set of locks
				382	held at the time of the access. This often makes it clear which
				383	thread, if any, failed to take a required lock. In this example
				384	neither thread holds a lock during the access.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	385	</listitem>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	386	<listitem>
				387	<para>For races which occur on global or stack variables, Helgrind
				388	tries to identify the name and defining point of the variable.
				389	Hence the text "<computeroutput>Location 0x601038 is 0 bytes inside
				390	global var "var" declared at simple_race.c:3</computeroutput>".</para>
				391	<para>Showing names of stack and global variables carries no
				392	run-time overhead once Helgrind has your program up and running.
				393	However, it does require Helgrind to spend considerable extra time
				394	and memory at program startup to read the relevant debug info.
				395	Hence this facility is disabled by default. To enable it, you need
njn	a331164	2009-08-10 01:29:14 +0000	[diff] [blame]	396	to give the <varname>--read-var-info=yes</varname> option to
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	397	Helgrind.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	398	</listitem>
				399	</itemizedlist>
				400
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	401	<para>The following section explains Helgrind's race detection
				402	algorithm in more detail.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	403
				404	</sect2>
				405
				406
				407
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	408	<sect2 id="hg-manual.data-races.algorithm" xreflabel="DR Algorithm">
				409	<title>Helgrind's Race Detection Algorithm</title>
				410
				411	<para>Most programmers think about threaded programming in terms of
sewardj	1a620d5	2008-12-23 11:13:07 +0000	[diff] [blame]	412	the basic functionality provided by the threading library (POSIX
				413	Pthreads): thread creation, thread joining, locks, condition
				414	variables, semaphores and barriers.</para>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	415
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	416	<para>The effect of using these functions is to impose
				417	constraints upon the order in which memory accesses can
sewardj	1a620d5	2008-12-23 11:13:07 +0000	[diff] [blame]	418	happen. This implied ordering is generally known as the
				419	"happens-before relation". Once you understand the happens-before
				420	relation, it is easy to see how Helgrind finds races in your code.
				421	Fortunately, the happens-before relation is itself easy to understand,
				422	and is by itself a useful tool for reasoning about the behaviour of
				423	parallel programs. We now introduce it using a simple example.</para>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	424
				425	<para>Consider first the following buggy program:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	426
				427	<programlisting><![CDATA[
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	428	Parent thread: Child thread:
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	429
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	430	int var;
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	431
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	432	// create child thread
				433	pthread_create(...)
				434	var = 20; var = 10;
				435	exit
				436
				437	// wait for child
				438	pthread_join(...)
				439	printf("%d\n", var);
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	440	]]></programlisting>
				441
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	442	<para>The parent thread creates a child. Both then write different
				443	values to some variable <computeroutput>var</computeroutput>, and the
				444	parent then waits for the child to exit.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	445
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	446	<para>What is the value of <computeroutput>var</computeroutput> at the
				447	end of the program, 10 or 20? We don't know. The program is
				448	considered buggy (it has a race) because the final value
				449	of <computeroutput>var</computeroutput> depends on the relative rates
				450	of progress of the parent and child threads. If the parent is fast
				451	and the child is slow, then the child's assignment may happen later,
				452	so the final value will be 10; and vice versa if the child is faster
				453	than the parent.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	454
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	455	<para>The relative rates of progress of parent vs child is not something
				456	the programmer can control, and will often change from run to run.
				457	It depends on factors such as the load on the machine, what else is
				458	running, the kernel's scheduling strategy, and many other factors.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	459
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	460	<para>The obvious fix is to use a lock to
				461	protect <computeroutput>var</computeroutput>. It is however
				462	instructive to consider a somewhat more abstract solution, which is to
				463	send a message from one thread to the other:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	464
				465	<programlisting><![CDATA[
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	466	Parent thread: Child thread:
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	467
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	468	int var;
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	469
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	470	// create child thread
				471	pthread_create(...)
				472	var = 20;
				473	// send message to child
				474	// wait for message to arrive
				475	var = 10;
				476	exit
				477
				478	// wait for child
				479	pthread_join(...)
				480	printf("%d\n", var);
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	481	]]></programlisting>
				482
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	483	<para>Now the program reliably prints "10", regardless of the speed of
				484	the threads. Why? Because the child's assignment cannot happen until
				485	after it receives the message. And the message is not sent until
				486	after the parent's assignment is done.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	487
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	488	<para>The message transmission creates a "happens-before" dependency
				489	between the two assignments: <computeroutput>var = 20;</computeroutput>
				490	must now happen-before <computeroutput>var = 10;</computeroutput>.
				491	And so there is no longer a race
				492	on <computeroutput>var</computeroutput>.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	493	</para>
				494
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	495	<para>Note that it's not significant that the parent sends a message
				496	to the child. Sending a message from the child (after its assignment)
				497	to the parent (before its assignment) would also fix the problem, causing
				498	the program to reliably print "20".</para>
				499
				500	<para>Helgrind's algorithm is (conceptually) very simple. It monitors all
				501	accesses to memory locations. If a location -- in this example,
				502	<computeroutput>var</computeroutput>,
				503	is accessed by two different threads, Helgrind checks to see if the
sewardj	1a620d5	2008-12-23 11:13:07 +0000	[diff] [blame]	504	two accesses are ordered by the happens-before relation. If so,
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	505	that's fine; if not, it reports a race.</para>
				506
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	507	<para>It is important to understand that the happens-before relation
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	508	creates only a partial ordering, not a total ordering. An example of
				509	a total ordering is comparison of numbers: for any two numbers
				510	<computeroutput>x</computeroutput> and
				511	<computeroutput>y</computeroutput>, either
				512	<computeroutput>x</computeroutput> is less than, equal to, or greater
				513	than
				514	<computeroutput>y</computeroutput>. A partial ordering is like a
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	515	total ordering, but it can also express the concept that two elements
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	516	are neither equal, less or greater, but merely unordered with respect
				517	to each other.</para>
				518
				519	<para>In the fixed example above, we say that
				520	<computeroutput>var = 20;</computeroutput> "happens-before"
				521	<computeroutput>var = 10;</computeroutput>. But in the original
				522	version, they are unordered: we cannot say that either happens-before
				523	the other.</para>
				524
				525	<para>What does it mean to say that two accesses from different
sewardj	1a620d5	2008-12-23 11:13:07 +0000	[diff] [blame]	526	threads are ordered by the happens-before relation? It means that
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	527	there is some chain of inter-thread synchronisation operations which
				528	cause those accesses to happen in a particular order, irrespective of
				529	the actual rates of progress of the individual threads. This is a
				530	required property for a reliable threaded program, which is why
				531	Helgrind checks for it.</para>
				532
				533	<para>The happens-before relations created by standard threading
				534	primitives are as follows:</para>
				535
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	536	<itemizedlist>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	537	<listitem><para>When a mutex is unlocked by thread T1 and later (or
				538	immediately) locked by thread T2, then the memory accesses in T1
				539	prior to the unlock must happen-before those in T2 after it acquires
				540	the lock.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	541	</listitem>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	542	<listitem><para>The same idea applies to reader-writer locks,
				543	although with some complication so as to allow correct handling of
				544	reads vs writes.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	545	</listitem>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	546	<listitem><para>When a condition variable (CV) is signalled on by
				547	thread T1 and some other thread T2 is thereby released from a wait
				548	on the same CV, then the memory accesses in T1 prior to the
				549	signalling must happen-before those in T2 after it returns from the
				550	wait. If no thread was waiting on the CV then there is no
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	551	effect.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	552	</listitem>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	553	<listitem><para>If instead T1 broadcasts on a CV, then all of the
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	554	waiting threads, rather than just one of them, acquire a
				555	happens-before dependency on the broadcasting thread at the point it
				556	did the broadcast.</para>
				557	</listitem>
				558	<listitem><para>A thread T2 that continues after completing sem_wait
				559	on a semaphore that thread T1 posts on, acquires a happens-before
				560	dependence on the posting thread, a bit like dependencies caused
				561	mutex unlock-lock pairs. However, since a semaphore can be posted
				562	on many times, it is unspecified from which of the post calls the
				563	wait call gets its happens-before dependency.</para>
				564	</listitem>
				565	<listitem><para>For a group of threads T1 .. Tn which arrive at a
				566	barrier and then move on, each thread after the call has a
				567	happens-after dependency from all threads before the
				568	barrier.</para>
				569	</listitem>
				570	<listitem><para>A newly-created child thread acquires an initial
				571	happens-after dependency on the point where its parent created it.
				572	That is, all memory accesses performed by the parent prior to
				573	creating the child are regarded as happening-before all the accesses
				574	of the child.</para>
				575	</listitem>
				576	<listitem><para>Similarly, when an exiting thread is reaped via a
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	577	call to <function>pthread_join</function>, once the call returns, the
				578	reaping thread acquires a happens-after dependency relative to all memory
				579	accesses made by the exiting thread.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	580	</listitem>
				581	</itemizedlist>
				582
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	583	<para>In summary: Helgrind intercepts the above listed events, and builds a
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	584	directed acyclic graph represented the collective happens-before
				585	dependencies. It also monitors all memory accesses.</para>
				586
				587	<para>If a location is accessed by two different threads, but Helgrind
				588	cannot find any path through the happens-before graph from one access
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	589	to the other, then it reports a race.</para>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	590
				591	<para>There are a couple of caveats:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	592
				593	<itemizedlist>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	594	<listitem><para>Helgrind doesn't check for a race in the case where
				595	both accesses are reads. That would be silly, since concurrent
				596	reads are harmless.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	597	</listitem>
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	598	<listitem><para>Two accesses are considered to be ordered by the
				599	happens-before dependency even through arbitrarily long chains of
				600	synchronisation events. For example, if T1 accesses some location
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	601	L, and then <function>pthread_cond_signals</function> T2, which later
				602	<function>pthread_cond_signals</function> T3, which then accesses L, then
				603	a suitable happens-before dependency exists between the first and second
sewardj	7c76839	2008-12-21 21:17:24 +0000	[diff] [blame]	604	accesses, even though it involves two different inter-thread
				605	synchronisation events.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	606	</listitem>
				607	</itemizedlist>
				608
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	609	</sect2>
				610
				611
				612
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	613	<sect2 id="hg-manual.data-races.errmsgs" xreflabel="Race Error Messages">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	614	<title>Interpreting Race Error Messages</title>
				615
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	616	<para>Helgrind's race detection algorithm collects a lot of
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	617	information, and tries to present it in a helpful way when a race is
				618	detected. Here's an example:</para>
				619
				620	<programlisting><![CDATA[
				621	Thread #2 was created
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	622	at 0x511C08E: clone (in /lib64/libc-2.8.so)
				623	by 0x4E333A4: do_clone (in /lib64/libpthread-2.8.so)
				624	by 0x4E33A30: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.8.so)
				625	by 0x4C299D4: pthread_create@* (hg_intercepts.c:214)
				626	by 0x4008F2: main (tc21_pthonce.c:86)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	627
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	628	Thread #3 was created
				629	at 0x511C08E: clone (in /lib64/libc-2.8.so)
				630	by 0x4E333A4: do_clone (in /lib64/libpthread-2.8.so)
				631	by 0x4E33A30: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.8.so)
				632	by 0x4C299D4: pthread_create@* (hg_intercepts.c:214)
				633	by 0x4008F2: main (tc21_pthonce.c:86)
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	634
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	635	Possible data race during read of size 4 at 0x601070 by thread #3
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	636	Locks held: none
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	637	at 0x40087A: child (tc21_pthonce.c:74)
				638	by 0x4C29AFF: mythread_wrapper (hg_intercepts.c:194)
				639	by 0x4E3403F: start_thread (in /lib64/libpthread-2.8.so)
				640	by 0x511C0CC: clone (in /lib64/libc-2.8.so)
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	641
				642	This conflicts with a previous write of size 4 by thread #2
				643	Locks held: none
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	644	at 0x400883: child (tc21_pthonce.c:74)
				645	by 0x4C29AFF: mythread_wrapper (hg_intercepts.c:194)
				646	by 0x4E3403F: start_thread (in /lib64/libpthread-2.8.so)
				647	by 0x511C0CC: clone (in /lib64/libc-2.8.so)
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	648
				649	Location 0x601070 is 0 bytes inside local var "unprotected2"
				650	declared at tc21_pthonce.c:51, in frame #0 of thread 3
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	651	]]></programlisting>
				652
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	653	<para>Helgrind first announces the creation points of any threads
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	654	referenced in the error message. This is so it can speak concisely
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	655	about threads without repeatedly printing their creation point call
				656	stacks. Each thread is only ever announced once, the first time it
				657	appears in any Helgrind error message.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	658
				659	<para>The main error message begins at the text
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	660	"<computeroutput>Possible data race during read</computeroutput>". At
				661	the start is information you would expect to see -- address and size
				662	of the racing access, whether a read or a write, and the call stack at
				663	the point it was detected.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	664
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	665	<para>A second call stack is presented starting at the text
				666	"<computeroutput>This conflicts with a previous
				667	write</computeroutput>". This shows a previous access which also
				668	accessed the stated address, and which is believed to be racing
philippe	5c165b2	2012-07-20 23:40:35 +0000	[diff] [blame]	669	against the access in the first call stack. Note that this second
				670	call stack is limited to a maximum of 8 entries to limit the
				671	memory usage.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	672
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	673	<para>Finally, Helgrind may attempt to give a description of the
				674	raced-on address in source level terms. In this example, it
				675	identifies it as a local variable, shows its name, declaration point,
				676	and in which frame (of the first call stack) it lives. Note that this
				677	information is only shown when <varname>--read-var-info=yes</varname>
				678	is specified on the command line. That's because reading the DWARF3
				679	debug information in enough detail to capture variable type and
				680	location information makes Helgrind much slower at startup, and also
				681	requires considerable amounts of memory, for large programs.
				682	</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	683
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	684	<para>Once you have your two call stacks, how do you find the root
				685	cause of the race?</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	686
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	687	<para>The first thing to do is examine the source locations referred
				688	to by each call stack. They should both show an access to the same
				689	location, or variable.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	690
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	691	<para>Now figure out how how that location should have been made
				692	thread-safe:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	693
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	694	<itemizedlist>
				695	<listitem><para>Perhaps the location was intended to be protected by
				696	a mutex? If so, you need to lock and unlock the mutex at both
				697	access points, even if one of the accesses is reported to be a read.
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	698	Did you perhaps forget the locking at one or other of the accesses?
				699	To help you do this, Helgrind shows the set of locks held by each
				700	threads at the time they accessed the raced-on location.</para>
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	701	</listitem>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	702	<listitem><para>Alternatively, perhaps you intended to use a some
				703	other scheme to make it safe, such as signalling on a condition
				704	variable. In all such cases, try to find a synchronisation event
				705	(or a chain thereof) which separates the earlier-observed access (as
				706	shown in the second call stack) from the later-observed access (as
				707	shown in the first call stack). In other words, try to find
				708	evidence that the earlier access "happens-before" the later access.
				709	See the previous subsection for an explanation of the happens-before
sewardj	1a620d5	2008-12-23 11:13:07 +0000	[diff] [blame]	710	relation.</para>
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	711	<para>
				712	The fact that Helgrind is reporting a race means it did not observe
sewardj	1a620d5	2008-12-23 11:13:07 +0000	[diff] [blame]	713	any happens-before relation between the two accesses. If
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	714	Helgrind is working correctly, it should also be the case that you
sewardj	1a620d5	2008-12-23 11:13:07 +0000	[diff] [blame]	715	also cannot find any such relation, even on detailed inspection
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	716	of the source code. Hopefully, though, your inspection of the code
				717	will show where the missing synchronisation operation(s) should have
				718	been.</para>
				719	</listitem>
				720	</itemizedlist>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	721
				722	</sect2>
				723
				724
				725	</sect1>
				726
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	727	<sect1 id="hg-manual.effective-use" xreflabel="Helgrind Effective Use">
				728	<title>Hints and Tips for Effective Use of Helgrind</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	729
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	730	<para>Helgrind can be very helpful in finding and resolving
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	731	threading-related problems. Like all sophisticated tools, it is most
				732	effective when you understand how to play to its strengths.</para>
				733
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	734	<para>Helgrind will be less effective when you merely throw an
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	735	existing threaded program at it and try to make sense of any reported
				736	errors. It will be more effective if you design threaded programs
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	737	from the start in a way that helps Helgrind verify correctness. The
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	738	same is true for finding memory errors with Memcheck, but applies more
				739	here, because thread checking is a harder problem. Consequently it is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	740	much easier to write a correct program for which Helgrind falsely
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	741	reports (threading) errors than it is to write a correct program for
				742	which Memcheck falsely reports (memory) errors.</para>
				743
				744	<para>With that in mind, here are some tips, listed most important first,
				745	for getting reliable results and avoiding false errors. The first two
				746	are critical. Any violations of them will swamp you with huge numbers
				747	of false data-race errors.</para>
				748
				749
				750	<orderedlist>
				751
				752	<listitem>
				753	<para>Make sure your application, and all the libraries it uses,
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	754	use the POSIX threading primitives. Helgrind needs to be able to
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	755	see all events pertaining to thread creation, exit, locking and
sewardj	3387889	2007-11-17 09:43:25 +0000	[diff] [blame]	756	other synchronisation events. To do so it intercepts many POSIX
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	757	pthreads functions.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	758
				759	<para>Do not roll your own threading primitives (mutexes, etc)
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	760	from combinations of the Linux futex syscall, atomic counters, etc.
				761	These throw Helgrind's internal what's-going-on models
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	762	way off course and will give bogus results.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	763
				764	<para>Also, do not reimplement existing POSIX abstractions using
				765	other POSIX abstractions. For example, don't build your own
				766	semaphore routines or reader-writer locks from POSIX mutexes and
				767	condition variables. Instead use POSIX reader-writer locks and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	768	semaphores directly, since Helgrind supports them directly.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	769
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	770	<para>Helgrind directly supports the following POSIX threading
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	771	abstractions: mutexes, reader-writer locks, condition variables
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	772	(but see below), semaphores and barriers. Currently spinlocks
				773	are not supported, although they could be in future.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	774
				775	<para>At the time of writing, the following popular Linux packages
				776	are known to implement their own threading primitives:</para>
				777
				778	<itemizedlist>
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	779	<listitem><para>Qt version 4.X. Qt 3.X is harmless in that it
				780	only uses POSIX pthreads primitives. Unfortunately Qt 4.X
				781	has its own implementation of mutexes (QMutex) and thread reaping.
				782	Helgrind 3.4.x contains direct support
				783	for Qt 4.X threading, which is experimental but is believed to
				784	work fairly well. A side effect of supporting Qt 4 directly is
				785	that Helgrind can be used to debug KDE4 applications. As this
				786	is an experimental feature, we would particularly appreciate
				787	feedback from folks who have used Helgrind to successfully debug
				788	Qt 4 and/or KDE4 applications.</para>
				789	</listitem>
				790	<listitem><para>Runtime support library for GNU OpenMP (part of
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	791	GCC), at least for GCC versions 4.2 and 4.3. The GNU OpenMP runtime
				792	library (<filename>libgomp.so</filename>) constructs its own
				793	synchronisation primitives using combinations of atomic memory
				794	instructions and the futex syscall, which causes total chaos since in
				795	Helgrind since it cannot "see" those.</para>
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	796	<para>Fortunately, this can be solved using a configuration-time
njn	a331164	2009-08-10 01:29:14 +0000	[diff] [blame]	797	option (for GCC). Rebuild GCC from source, and configure using
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	798	<varname>--disable-linux-futex</varname>.
				799	This makes libgomp.so use the standard
				800	POSIX threading primitives instead. Note that this was tested
njn	7316df2	2009-08-04 01:16:01 +0000	[diff] [blame]	801	using GCC 4.2.3 and has not been re-tested using more recent GCC
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	802	versions. We would appreciate hearing about any successes or
				803	failures with more recent versions.</para>
				804	</listitem>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	805	</itemizedlist>
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	806
				807	<para>If you must implement your own threading primitives, there
				808	are a set of client request macros
				809	in <computeroutput>helgrind.h</computeroutput> to help you
				810	describe your primitives to Helgrind. You should be able to
				811	mark up mutexes, condition variables, etc, without difficulty.
				812	</para>
				813	<para>
				814	It is also possible to mark up the effects of thread-safe
				815	reference counting using the
				816	<computeroutput>ANNOTATE_HAPPENS_BEFORE</computeroutput>,
				817	<computeroutput>ANNOTATE_HAPPENS_AFTER</computeroutput> and
				818	<computeroutput>ANNOTATE_HAPPENS_BEFORE_FORGET_ALL</computeroutput>,
				819	macros. Thread-safe reference counting using an atomically
				820	incremented/decremented refcount variable causes Helgrind
				821	problems because a one-to-zero transition of the reference count
				822	means the accessing thread has exclusive ownership of the
				823	associated resource (normally, a C++ object) and can therefore
				824	access it (normally, to run its destructor) without locking.
				825	Helgrind doesn't understand this, and markup is essential to
				826	avoid false positives.
				827	</para>
				828
				829	<para>
				830	Here are recommended guidelines for marking up thread safe
				831	reference counting in C++. You only need to mark up your
				832	release methods -- the ones which decrement the reference count.
				833	Given a class like this:
				834	</para>
				835
				836	<programlisting><![CDATA[
				837	class MyClass {
				838	unsigned int mRefCount;
				839
				840	void Release ( void ) {
				841	unsigned int newCount = atomic_decrement(&mRefCount);
				842	if (newCount == 0) {
				843	delete this;
				844	}
				845	}
				846	}
				847	]]></programlisting>
				848
				849	<para>
				850	the release method should be marked up as follows:
				851	</para>
				852
				853	<programlisting><![CDATA[
				854	void Release ( void ) {
				855	unsigned int newCount = atomic_decrement(&mRefCount);
				856	if (newCount == 0) {
				857	ANNOTATE_HAPPENS_AFTER(&mRefCount);
				858	ANNOTATE_HAPPENS_BEFORE_FORGET_ALL(&mRefCount);
				859	delete this;
				860	} else {
				861	ANNOTATE_HAPPENS_BEFORE(&mRefCount);
				862	}
				863	}
				864	]]></programlisting>
				865
				866	<para>
				867	There are a number of complex, mostly-theoretical objections to
				868	this scheme. From a theoretical standpoint it appears to be
				869	impossible to devise a markup scheme which is completely correct
				870	in the sense of guaranteeing to remove all false races. The
				871	proposed scheme however works well in practice.
				872	</para>
				873
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	874	</listitem>
				875
				876	<listitem>
				877	<para>Avoid memory recycling. If you can't avoid it, you must use
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	878	tell Helgrind what is going on via the
				879	<function>VALGRIND_HG_CLEAN_MEMORY</function> client request (in
				880	<computeroutput>helgrind.h</computeroutput>).</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	881
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	882	<para>Helgrind is aware of standard heap memory allocation and
				883	deallocation that occurs via
				884	<function>malloc</function>/<function>free</function>/<function>new</function>/<function>delete</function>
				885	and from entry and exit of stack frames. In particular, when memory is
				886	deallocated via <function>free</function>, <function>delete</function>,
				887	or function exit, Helgrind considers that memory clean, so when it is
				888	eventually reallocated, its history is irrelevant.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	889
				890	<para>However, it is common practice to implement memory recycling
				891	schemes. In these, memory to be freed is not handed to
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	892	<function>free</function>/<function>delete</function>, but instead put
				893	into a pool of free buffers to be handed out again as required. The
				894	problem is that Helgrind has no
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	895	way to know that such memory is logically no longer in use, and
				896	its history is irrelevant. Hence you must make that explicit,
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	897	using the <function>VALGRIND_HG_CLEAN_MEMORY</function> client request
				898	to specify the relevant address ranges. It's easiest to put these
				899	requests into the pool manager code, and use them either when memory is
				900	returned to the pool, or is allocated from it.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	901	</listitem>
				902
				903	<listitem>
				904	<para>Avoid POSIX condition variables. If you can, use POSIX
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	905	semaphores (<function>sem_t</function>, <function>sem_post</function>,
				906	<function>sem_wait</function>) to do inter-thread event signalling.
				907	Semaphores with an initial value of zero are particularly useful for
				908	this.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	909
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	910	<para>Helgrind only partially correctly handles POSIX condition
				911	variables. This is because Helgrind can see inter-thread
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	912	dependencies between a <function>pthread_cond_wait</function> call and a
				913	<function>pthread_cond_signal</function>/<function>pthread_cond_broadcast</function>
				914	call only if the waiting thread actually gets to the rendezvous first
				915	(so that it actually calls
				916	<function>pthread_cond_wait</function>). It can't see dependencies
				917	between the threads if the signaller arrives first. In the latter case,
				918	POSIX guidelines imply that the associated boolean condition still
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	919	provides an inter-thread synchronisation event, but one which is
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	920	invisible to Helgrind.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	921
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	922	<para>The result of Helgrind missing some inter-thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	923	synchronisation events is to cause it to report false positives.
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	924	</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	925
				926	<para>The root cause of this synchronisation lossage is
				927	particularly hard to understand, so an example is helpful. It was
				928	discussed at length by Arndt Muehlenfeld ("Runtime Race Detection
				929	in Multi-Threaded Programs", Dissertation, TU Graz, Austria). The
				930	canonical POSIX-recommended usage scheme for condition variables
				931	is as follows:</para>
				932
				933	<programlisting><![CDATA[
				934	b is a Boolean condition, which is False most of the time
				935	cv is a condition variable
				936	mx is its associated mutex
				937
				938	Signaller: Waiter:
				939
				940	lock(mx) lock(mx)
				941	b = True while (b == False)
				942	signal(cv) wait(cv,mx)
				943	unlock(mx) unlock(mx)
				944	]]></programlisting>
				945
				946	<para>Assume <computeroutput>b</computeroutput> is False most of
				947	the time. If the waiter arrives at the rendezvous first, it
				948	enters its while-loop, waits for the signaller to signal, and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	949	eventually proceeds. Helgrind sees the signal, notes the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	950	dependency, and all is well.</para>
				951
				952	<para>If the signaller arrives
				953	first, <computeroutput>b</computeroutput> is set to true, and the
				954	signal disappears into nowhere. When the waiter later arrives, it
				955	does not enter its while-loop and simply carries on. But even in
				956	this case, the waiter code following the while-loop cannot execute
				957	until the signaller sets <computeroutput>b</computeroutput> to
				958	True. Hence there is still the same inter-thread dependency, but
				959	this time it is through an arbitrary in-memory condition, and
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	960	Helgrind cannot see it.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	961
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	962	<para>By comparison, Helgrind's detection of inter-thread
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	963	dependencies caused by semaphore operations is believed to be
				964	exactly correct.</para>
				965
				966	<para>As far as I know, a solution to this problem that does not
				967	require source-level annotation of condition-variable wait loops
				968	is beyond the current state of the art.</para>
				969	</listitem>
				970
				971	<listitem>
				972	<para>Make sure you are using a supported Linux distribution. At
sewardj	5246990	2008-12-21 23:11:14 +0000	[diff] [blame]	973	present, Helgrind only properly supports glibc-2.3 or later. This
				974	in turn means we only support glibc's NPTL threading
				975	implementation. The old LinuxThreads implementation is not
				976	supported.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	977	</listitem>
				978
				979	<listitem>
philippe	9848690	2014-08-19 22:46:44 +0000	[diff] [blame]	980	<para>If your application is using thread local variables,
				981	helgrind might report false positive race conditions on these
				982	variables, despite being very probably race free. On Linux, you can
				983	use <option>--sim-hints=deactivate-pthread-stack-cache-via-hack</option>
				984	to avoid such false positive error messages
				985	(see <xref linkend="opt.sim-hints"/>).
				986	</para>
				987	</listitem>
				988
				989	<listitem>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	990	<para>Round up all finished threads using
				991	<function>pthread_join</function>. Avoid
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	992	detaching threads: don't create threads in the detached state, and
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	993	don't call <function>pthread_detach</function> on existing threads.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	994
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	995	<para>Using <function>pthread_join</function> to round up finished
				996	threads provides a clear synchronisation point that both Helgrind and
				997	programmers can see. If you don't call
				998	<function>pthread_join</function> on a thread, Helgrind has no way to
				999	know when it finishes, relative to any
				1000	significant synchronisation points for other threads in the program. So
				1001	it assumes that the thread lingers indefinitely and can potentially
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1002	interfere indefinitely with the memory state of the program. It
				1003	has every right to assume that -- after all, it might really be
				1004	the case that, for scheduling reasons, the exiting thread did run
				1005	very slowly in the last stages of its life.</para>
				1006	</listitem>
				1007
				1008	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1009	<para>Perform thread debugging (with Helgrind) and memory
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1010	debugging (with Memcheck) together.</para>
				1011
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1012	<para>Helgrind tracks the state of memory in detail, and memory
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1013	management bugs in the application are liable to cause confusion.
				1014	In extreme cases, applications which do many invalid reads and
				1015	writes (particularly to freed memory) have been known to crash
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1016	Helgrind. So, ideally, you should make your application
				1017	Memcheck-clean before using Helgrind.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1018
				1019	<para>It may be impossible to make your application Memcheck-clean
				1020	unless you first remove threading bugs. In particular, it may be
				1021	difficult to remove all reads and writes to freed memory in
				1022	multithreaded C++ destructor sequences at program termination.
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1023	So, ideally, you should make your application Helgrind-clean
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1024	before using Memcheck.</para>
				1025
				1026	<para>Since this circularity is obviously unresolvable, at least
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1027	bear in mind that Memcheck and Helgrind are to some extent
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1028	complementary, and you may need to use them together.</para>
				1029	</listitem>
				1030
				1031	<listitem>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1032	<para>POSIX requires that implementations of standard I/O
				1033	(<function>printf</function>, <function>fprintf</function>,
				1034	<function>fwrite</function>, <function>fread</function>, etc) are thread
				1035	safe. Unfortunately GNU libc implements this by using internal locking
				1036	primitives that Helgrind is unable to intercept. Consequently Helgrind
				1037	generates many false race reports when you use these functions.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1038
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1039	<para>Helgrind attempts to hide these errors using the standard
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1040	Valgrind error-suppression mechanism. So, at least for simple
				1041	test cases, you don't see any. Nevertheless, some may slip
				1042	through. Just something to be aware of.</para>
				1043	</listitem>
				1044
				1045	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1046	<para>Helgrind's error checks do not work properly inside the
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1047	system threading library itself
				1048	(<computeroutput>libpthread.so</computeroutput>), and it usually
				1049	observes large numbers of (false) errors in there. Valgrind's
				1050	suppression system then filters these out, so you should not see
				1051	them.</para>
				1052
				1053	<para>If you see any race errors reported
				1054	where <computeroutput>libpthread.so</computeroutput> or
				1055	<computeroutput>ld.so</computeroutput> is the object associated
				1056	with the innermost stack frame, please file a bug report at
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1057	<ulink url="&vg-url;">&vg-url;</ulink>.
				1058	</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1059	</listitem>
				1060
				1061	</orderedlist>
				1062
				1063	</sect1>
				1064
				1065
				1066
				1067
njn	a331164	2009-08-10 01:29:14 +0000	[diff] [blame]	1068	<sect1 id="hg-manual.options" xreflabel="Helgrind Command-line Options">
				1069	<title>Helgrind Command-line Options</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1070
				1071	<para>The following end-user options are available:</para>
				1072
				1073	<!-- start of xi:include in the manpage -->
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1074	<variablelist id="hg.opts.list">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1075
sewardj	622fe49	2011-03-11 21:06:59 +0000	[diff] [blame]	1076	<varlistentry id="opt.free-is-write"
				1077	xreflabel="--free-is-write">
				1078	<term>
				1079	<option><![CDATA[--free-is-write=no\|yes
				1080	[default: no] ]]></option>
				1081	</term>
				1082	<listitem>
				1083	<para>When enabled (not the default), Helgrind treats freeing of
				1084	heap memory as if the memory was written immediately before
				1085	the free. This exposes races where memory is referenced by
				1086	one thread, and freed by another, but there is no observable
				1087	synchronisation event to ensure that the reference happens
				1088	before the free.
				1089	</para>
				1090	<para>This functionality is new in Valgrind 3.7.0, and is
				1091	regarded as experimental. It is not enabled by default
				1092	because its interaction with custom memory allocators is not
				1093	well understood at present. User feedback is welcomed.
				1094	</para>
				1095	</listitem>
				1096	</varlistentry>
				1097
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1098	<varlistentry id="opt.track-lockorders"
				1099	xreflabel="--track-lockorders">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1100	<term>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1101	<option><![CDATA[--track-lockorders=no\|yes
				1102	[default: yes] ]]></option>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1103	</term>
				1104	<listitem>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1105	<para>When enabled (the default), Helgrind performs lock order
				1106	consistency checking. For some buggy programs, the large number
				1107	of lock order errors reported can become annoying, particularly
				1108	if you're only interested in race errors. You may therefore find
				1109	it helpful to disable lock order checking.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1110	</listitem>
				1111	</varlistentry>
				1112
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1113	<varlistentry id="opt.history-level"
				1114	xreflabel="--history-level">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1115	<term>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1116	<option><![CDATA[--history-level=none\|approx\|full
				1117	[default: full] ]]></option>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1118	</term>
				1119	<listitem>
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	1120	<para><option>--history-level=full</option> (the default) causes
				1121	Helgrind collects enough information about "old" accesses that
				1122	it can produce two stack traces in a race report -- both the
				1123	stack trace for the current access, and the trace for the
philippe	5c165b2	2012-07-20 23:40:35 +0000	[diff] [blame]	1124	older, conflicting access. To limit memory usage, "old" accesses
				1125	stack traces are limited to a maximum of 8 entries, even if
				1126	<option>--num-callers</option> value is bigger.</para>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1127	<para>Collecting such information is expensive in both speed and
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	1128	memory, particularly for programs that do many inter-thread
				1129	synchronisation events (locks, unlocks, etc). Without such
				1130	information, it is more difficult to track down the root
				1131	causes of races. Nonetheless, you may not need it in
				1132	situations where you just want to check for the presence or
				1133	absence of races, for example, when doing regression testing
				1134	of a previously race-free program.</para>
				1135	<para><option>--history-level=none</option> is the opposite
				1136	extreme. It causes Helgrind not to collect any information
				1137	about previous accesses. This can be dramatically faster
				1138	than <option>--history-level=full</option>.</para>
				1139	<para><option>--history-level=approx</option> provides a
				1140	compromise between these two extremes. It causes Helgrind to
				1141	show a full trace for the later access, and approximate
				1142	information regarding the earlier access. This approximate
				1143	information consists of two stacks, and the earlier access is
				1144	guaranteed to have occurred somewhere between program points
				1145	denoted by the two stacks. This is not as useful as showing
				1146	the exact stack for the previous access
				1147	(as <option>--history-level=full</option> does), but it is
				1148	better than nothing, and it is almost as fast as
				1149	<option>--history-level=none</option>.</para>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1150	</listitem>
				1151	</varlistentry>
				1152
				1153	<varlistentry id="opt.conflict-cache-size"
				1154	xreflabel="--conflict-cache-size">
				1155	<term>
				1156	<option><![CDATA[--conflict-cache-size=N
				1157	[default: 1000000] ]]></option>
				1158	</term>
				1159	<listitem>
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	1160	<para>This flag only has any effect
				1161	at <option>--history-level=full</option>.</para>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1162	<para>Information about "old" conflicting accesses is stored in
				1163	a cache of limited size, with LRU-style management. This is
				1164	necessary because it isn't practical to store a stack trace
				1165	for every single memory access made by the program.
				1166	Historical information on not recently accessed locations is
				1167	periodically discarded, to free up space in the cache.</para>
njn	a331164	2009-08-10 01:29:14 +0000	[diff] [blame]	1168	<para>This option controls the size of the cache, in terms of the
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1169	number of different memory addresses for which
				1170	conflicting access information is stored. If you find that
				1171	Helgrind is showing race errors with only one stack instead of
				1172	the expected two stacks, try increasing this value.</para>
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	1173	<para>The minimum value is 10,000 and the maximum is 30,000,000
				1174	(thirty times the default value). Increasing the value by 1
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1175	increases Helgrind's memory requirement by very roughly 100
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	1176	bytes, so the maximum value will easily eat up three extra
sewardj	78bb7f6	2009-08-14 21:33:34 +0000	[diff] [blame]	1177	gigabytes or so of memory.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1178	</listitem>
				1179	</varlistentry>
				1180
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	1181	<varlistentry id="opt.check-stack-refs"
				1182	xreflabel="--check-stack-refs">
				1183	<term>
				1184	<option><![CDATA[--check-stack-refs=no\|yes
				1185	[default: yes] ]]></option>
				1186	</term>
				1187	<listitem>
				1188	<para>
				1189	By default Helgrind checks all data memory accesses made by your
				1190	program. This flag enables you to skip checking for accesses
				1191	to thread stacks (local variables). This can improve
				1192	performance, but comes at the cost of missing races on
				1193	stack-allocated data.
				1194	</para>
				1195	</listitem>
				1196	</varlistentry>
				1197
sewardj	8eb8bab	2015-07-21 14:44:28 +0000	[diff] [blame]	1198	<varlistentry id="opt.ignore-thread-creation"
				1199	xreflabel="--ignore-thread-creation">
				1200	<term>
				1201	<option><![CDATA[--ignore-thread-creation=<yes\|no>
				1202	[default: no]]]></option>
				1203	</term>
				1204	<listitem>
				1205	<para>
				1206	Controls whether all activities during thread creation should be
				1207	ignored. By default enabled only on Solaris.
				1208	Solaris provides higher throughput, parallelism and scalability than
				1209	other operating systems, at the cost of more fine-grained locking
				1210	activity. This means for example that when a thread is created under
				1211	glibc, just one big lock is used for all thread setup. Solaris libc
				1212	uses several fine-grained locks and the creator thread resumes its
				1213	activities as soon as possible, leaving for example stack and TLS setup
				1214	sequence to the created thread.
				1215	This situation confuses Helgrind as it assumes there is some false
				1216	ordering in place between creator and created thread; and therefore many
				1217	types of race conditions in the application would not be reported.
				1218	To prevent such false ordering, this command line option is set to
				1219	<computeroutput>yes</computeroutput> by default on Solaris.
				1220	All activity (loads, stores, client requests) is therefore ignored
				1221	during:</para>
				1222	<itemizedlist>
				1223	<listitem>
				1224	<para>
				1225	pthread_create() call in the creator thread
				1226	</para>
				1227	</listitem>
				1228	<listitem>
				1229	<para>
				1230	thread creation phase (stack and TLS setup) in the created thread
				1231	</para>
				1232	</listitem>
				1233	</itemizedlist>
				1234	<para>
				1235	Also new memory allocated during thread creation is untracked,
				1236	that is race reporting is suppressed there. DRD does the same thing
				1237	implicitly. This is necessary because Solaris libc caches many objects
				1238	and reuses them for different threads and that confuses
				1239	Helgrind.</para>
				1240	</listitem>
				1241	</varlistentry>
				1242
sewardj	70ceabc	2011-06-24 18:23:42 +0000	[diff] [blame]	1243
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1244	</variablelist>
				1245	<!-- end of xi:include in the manpage -->
				1246
				1247	<!-- start of xi:include in the manpage -->
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1248	<!-- commented out, because we don't document debugging options in the
				1249	manual. Nb: all the double-dashes below had a space inserted in them
				1250	to avoid problems with premature closing of this comment.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1251	<para>In addition, the following debugging options are available for
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1252	Helgrind:</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1253
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1254	<variablelist id="hg.debugopts.list">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1255
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1256	<varlistentry id="opt.trace-malloc" xreflabel="- -trace-malloc">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1257	<term>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1258	<option><![CDATA[- -trace-malloc=no\|yes [no]
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1259	]]></option>
				1260	</term>
				1261	<listitem>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1262	<para>Show all client <function>malloc</function> (etc) and
				1263	<function>free</function> (etc) requests.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1264	</listitem>
				1265	</varlistentry>
				1266
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1267	<varlistentry id="opt.cmp-race-err-addrs"
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1268	xreflabel="- -cmp-race-err-addrs">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1269	<term>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1270	<option><![CDATA[- -cmp-race-err-addrs=no\|yes [no]
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1271	]]></option>
				1272	</term>
				1273	<listitem>
				1274	<para>Controls whether or not race (data) addresses should be
				1275	taken into account when removing duplicates of race errors.
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1276	With <varname>- -cmp-race-err-addrs=no</varname>, two otherwise
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1277	identical race errors will be considered to be the same if
				1278	their race addresses differ. With
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1279	With <varname>- -cmp-race-err-addrs=yes</varname> they will be
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1280	considered different. This is provided to help make certain
				1281	regression tests work reliably.</para>
				1282	</listitem>
				1283	</varlistentry>
				1284
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1285	<varlistentry id="opt.hg-sanity-flags" xreflabel="- -hg-sanity-flags">
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1286	<term>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1287	<option><![CDATA[- -hg-sanity-flags=<XXXXXX> (X = 0\|1) [000000]
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1288	]]></option>
				1289	</term>
				1290	<listitem>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1291	<para>Run extensive sanity checks on Helgrind's internal
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1292	data structures at events defined by the bitstring, as
				1293	follows:</para>
sewardj	11e352f	2007-11-30 11:11:02 +0000	[diff] [blame]	1294	<para><computeroutput>010000 </computeroutput>after changes to
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1295	the lock order acquisition graph</para>
sewardj	11e352f	2007-11-30 11:11:02 +0000	[diff] [blame]	1296	<para><computeroutput>001000 </computeroutput>after every client
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1297	memory access (NB: not currently used)</para>
sewardj	11e352f	2007-11-30 11:11:02 +0000	[diff] [blame]	1298	<para><computeroutput>000100 </computeroutput>after every client
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1299	memory range permission setting of 256 bytes or greater</para>
sewardj	11e352f	2007-11-30 11:11:02 +0000	[diff] [blame]	1300	<para><computeroutput>000010 </computeroutput>after every client
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1301	lock or unlock event</para>
sewardj	11e352f	2007-11-30 11:11:02 +0000	[diff] [blame]	1302	<para><computeroutput>000001 </computeroutput>after every client
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1303	thread creation or joinage event</para>
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1304	<para>Note these will make Helgrind run very slowly, often to
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1305	the point of being completely unusable.</para>
				1306	</listitem>
				1307	</varlistentry>
				1308
				1309	</variablelist>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1310	-->
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1311	<!-- end of xi:include in the manpage -->
				1312
				1313
				1314	</sect1>
				1315
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1316
philippe	f577434	2014-05-03 11:12:50 +0000	[diff] [blame]	1317	<sect1 id="hg-manual.monitor-commands" xreflabel="Helgrind Monitor Commands">
				1318	<title>Helgrind Monitor Commands</title>
				1319	<para>The Helgrind tool provides monitor commands handled by Valgrind's
				1320	built-in gdbserver (see <xref linkend="manual-core-adv.gdbserver-commandhandling"/>).
				1321	</para>
				1322	<itemizedlist>
				1323	<listitem>
philippe	328d662	2015-05-25 17:24:27 +0000	[diff] [blame]	1324	<para><varname>info locks [lock_addr]</varname> shows the list of locks
				1325	and their status. If <varname>lock_addr</varname> is given, only shows
				1326	the lock located at this address. </para>
philippe	f577434	2014-05-03 11:12:50 +0000	[diff] [blame]	1327	<para>
philippe	328d662	2015-05-25 17:24:27 +0000	[diff] [blame]	1328	In the following example, helgrind knows about one lock. This
				1329	lock is located at the guest address <varname>ga
				1330	0x8049a20</varname>. The lock kind is <varname>rdwr</varname>
				1331	indicating a reader-writer lock. Other possible lock kinds
				1332	are <varname>nonRec</varname> (simple mutex, non recursive)
philippe	f577434	2014-05-03 11:12:50 +0000	[diff] [blame]	1333	and <varname>mbRec</varname> (simple mutex, possibly recursive).
philippe	328d662	2015-05-25 17:24:27 +0000	[diff] [blame]	1334	The lock kind is then followed by the list of threads helding the
				1335	lock. In the below example, <varname>R1:thread #6 tid 3</varname>
				1336	indicates that the helgrind thread #6 has acquired (once, as the
				1337	counter following the letter R is one) the lock in read mode. The
				1338	helgrind thread nr is incremented for each started thread. The
				1339	presence of 'tid 3' indicates that the thread #6 is has not exited
				1340	yet and is the valgrind tid 3. If a thread has terminated, then
				1341	this is indicated with 'tid (exited)'.
philippe	f577434	2014-05-03 11:12:50 +0000	[diff] [blame]	1342	</para>
				1343	<programlisting><![CDATA[
				1344	(gdb) monitor info locks
				1345	Lock ga 0x8049a20 {
				1346	kind rdwr
				1347	{ R1:thread #6 tid 3 }
				1348	}
				1349	(gdb)
				1350	]]></programlisting>
				1351
philippe	328d662	2015-05-25 17:24:27 +0000	[diff] [blame]	1352	<para> If you give the option <varname>--read-var-info=yes</varname>,
				1353	then more information will be provided about the lock location, such as
				1354	the global variable or the heap block that contains the lock:
philippe	f577434	2014-05-03 11:12:50 +0000	[diff] [blame]	1355	</para>
				1356	<programlisting><![CDATA[
				1357	Lock ga 0x8049a20 {
philippe	07c0852	2014-05-14 20:39:27 +0000	[diff] [blame]	1358	Location 0x8049a20 is 0 bytes inside global var "s_rwlock"
				1359	declared at rwlock_race.c:17
philippe	f577434	2014-05-03 11:12:50 +0000	[diff] [blame]	1360	kind rdwr
				1361	{ R1:thread #3 tid 3 }
				1362	}
				1363	]]></programlisting>
				1364
				1365	</listitem>
				1366
philippe	328d662	2015-05-25 17:24:27 +0000	[diff] [blame]	1367	<listitem>
				1368	<para><varname>accesshistory <addr> [<len>]</varname>
				1369	shows the access history recorded for <len> (default 1) bytes
				1370	starting at <addr>. For each recorded access that overlaps
				1371	with the given range, <varname>accesshistory</varname> shows the operation
				1372	type (read or write), the address and size read or written, the helgrind
				1373	thread nr/valgrind tid number that did the operation and the locks held
				1374	by the thread at the time of the operation.
				1375	The oldest access is shown first, the most recent access is shown last.
				1376	</para>
				1377	<para>
				1378	In the following example, we see first a recorded write of 4 bytes by
				1379	thread #7 that has modified the given 2 bytes range.
				1380	The second recorded write is the most recent recorded write : thread #9
				1381	modified the same 2 bytes as part of a 4 bytes write operation.
				1382	The list of locks held by each thread at the time of the write operation
				1383	are also shown.
				1384	</para>
				1385	<programlisting><![CDATA[
				1386	(gdb) monitor accesshistory 0x8049D8A 2
				1387	write of size 4 at 0x8049D88 by thread #7 tid 3
				1388	==6319== Locks held: 2, at address 0x8049D8C (and 1 that can't be shown)
				1389	==6319== at 0x804865F: child_fn1 (locked_vs_unlocked2.c:29)
				1390	==6319== by 0x400AE61: mythread_wrapper (hg_intercepts.c:234)
				1391	==6319== by 0x39B924: start_thread (pthread_create.c:297)
				1392	==6319== by 0x2F107D: clone (clone.S:130)
				1393
				1394	write of size 4 at 0x8049D88 by thread #9 tid 2
				1395	==6319== Locks held: 2, at addresses 0x8049DA4 0x8049DD4
				1396	==6319== at 0x804877B: child_fn2 (locked_vs_unlocked2.c:45)
				1397	==6319== by 0x400AE61: mythread_wrapper (hg_intercepts.c:234)
				1398	==6319== by 0x39B924: start_thread (pthread_create.c:297)
				1399	==6319== by 0x2F107D: clone (clone.S:130)
				1400
				1401	]]></programlisting>
				1402
				1403	</listitem>
				1404
philippe	f577434	2014-05-03 11:12:50 +0000	[diff] [blame]	1405	</itemizedlist>
				1406
				1407	</sect1>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1408
				1409	<sect1 id="hg-manual.client-requests" xreflabel="Helgrind Client Requests">
				1410	<title>Helgrind Client Requests</title>
				1411
				1412	<para>The following client requests are defined in
				1413	<filename>helgrind.h</filename>. See that file for exact details of their
				1414	arguments.</para>
				1415
				1416	<itemizedlist>
				1417
				1418	<listitem>
sewardj	3d49844	2009-08-16 22:47:02 +0000	[diff] [blame]	1419	<para><function>VALGRIND_HG_CLEAN_MEMORY</function></para>
				1420	<para>This makes Helgrind forget everything it knows about a
				1421	specified memory range. This is particularly useful for memory
				1422	allocators that wish to recycle memory.</para>
				1423	</listitem>
				1424	<listitem>
				1425	<para><function>ANNOTATE_HAPPENS_BEFORE</function></para>
				1426	</listitem>
				1427	<listitem>
				1428	<para><function>ANNOTATE_HAPPENS_AFTER</function></para>
				1429	</listitem>
				1430	<listitem>
				1431	<para><function>ANNOTATE_NEW_MEMORY</function></para>
				1432	</listitem>
				1433	<listitem>
				1434	<para><function>ANNOTATE_RWLOCK_CREATE</function></para>
				1435	</listitem>
				1436	<listitem>
				1437	<para><function>ANNOTATE_RWLOCK_DESTROY</function></para>
				1438	</listitem>
				1439	<listitem>
				1440	<para><function>ANNOTATE_RWLOCK_ACQUIRED</function></para>
				1441	</listitem>
				1442	<listitem>
				1443	<para><function>ANNOTATE_RWLOCK_RELEASED</function></para>
				1444	<para>These are used to describe to Helgrind, the behaviour of
				1445	custom (non-POSIX) synchronisation primitives, which it otherwise
				1446	has no way to understand. See comments
				1447	in <filename>helgrind.h</filename> for further
				1448	documentation.</para>
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1449	</listitem>
				1450
				1451	</itemizedlist>
				1452
				1453	</sect1>
				1454
				1455
				1456
sewardj	572feb7	2007-11-09 23:59:49 +0000	[diff] [blame]	1457	<sect1 id="hg-manual.todolist" xreflabel="To Do List">
				1458	<title>A To-Do List for Helgrind</title>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1459
				1460	<para>The following is a list of loose ends which should be tidied up
				1461	some time.</para>
				1462
				1463	<itemizedlist>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1464	<listitem><para>For lock order errors, print the complete lock
				1465	cycle, rather than only doing for size-2 cycles as at
				1466	present.</para>
				1467	</listitem>
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1468	<listitem><para>The conflicting access mechanism sometimes
				1469	mysteriously fails to show the conflicting access' stack, even
				1470	when provided with unbounded storage for conflicting access info.
				1471	This should be investigated.</para>
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1472	</listitem>
njn	7316df2	2009-08-04 01:16:01 +0000	[diff] [blame]	1473	<listitem><para>Document races caused by GCC's thread-unsafe code
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1474	generation for speculative stores. In the interim see
				1475	<computeroutput>http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html
				1476	</computeroutput>
				1477	and <computeroutput>http://lkml.org/lkml/2007/10/24/673</computeroutput>.
				1478	</para>
				1479	</listitem>
				1480	<listitem><para>Don't update the lock-order graph, and don't check
njn	f6e8ca9	2009-08-07 02:18:00 +0000	[diff] [blame]	1481	for errors, when a "try"-style lock operation happens (e.g.
				1482	<function>pthread_mutex_trylock</function>). Such calls do not add any real
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1483	restrictions to the locking order, since they can always fail to
				1484	acquire the lock, resulting in the caller going off and doing Plan
				1485	B (presumably it will have a Plan B). Doing such checks could
				1486	generate false lock-order errors and confuse users.</para>
				1487	</listitem>
				1488	<listitem><para> Performance can be very poor. Slowdowns on the
sewardj	c6a1cd1	2008-12-22 00:39:41 +0000	[diff] [blame]	1489	order of 100:1 are not unusual. There is limited scope for
				1490	performance improvements.
sewardj	b411202	2007-11-09 22:49:28 +0000	[diff] [blame]	1491	</para>
				1492	</listitem>
				1493
				1494	</itemizedlist>
				1495
				1496	</sect1>
				1497
				1498	</chapter>