sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1 | <?xml version="1.0"?> <!-- -*- sgml -*- --> |
| 2 | <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame] | 3 | "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" |
| 4 | [ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 5 | |
| 6 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 7 | <chapter id="hg-manual" xreflabel="Helgrind: thread error detector"> |
| 8 | <title>Helgrind: a thread error detector</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 9 | |
| 10 | <para>To use this tool, you must specify |
njn | 7e5d4ed | 2009-07-30 02:57:52 +0000 | [diff] [blame] | 11 | <option>--tool=helgrind</option> on the Valgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 12 | command line.</para> |
| 13 | |
| 14 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 15 | <sect1 id="hg-manual.overview" xreflabel="Overview"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 16 | <title>Overview</title> |
| 17 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 18 | <para>Helgrind is a Valgrind tool for detecting synchronisation errors |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 19 | in C, C++ and Fortran programs that use the POSIX pthreads |
| 20 | threading primitives.</para> |
| 21 | |
| 22 | <para>The main abstractions in POSIX pthreads are: a set of threads |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 23 | sharing a common address space, thread creation, thread joining, |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 24 | thread exit, mutexes (locks), condition variables (inter-thread event |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 25 | notifications), reader-writer locks, spinlocks, semaphores and |
| 26 | barriers.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 27 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 28 | <para>Helgrind can detect three classes of errors, which are discussed |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 29 | in detail in the next three sections:</para> |
| 30 | |
| 31 | <orderedlist> |
| 32 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 33 | <para><link linkend="hg-manual.api-checks"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 34 | Misuses of the POSIX pthreads API.</link></para> |
| 35 | </listitem> |
| 36 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 37 | <para><link linkend="hg-manual.lock-orders"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 38 | Potential deadlocks arising from lock |
| 39 | ordering problems.</link></para> |
| 40 | </listitem> |
| 41 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 42 | <para><link linkend="hg-manual.data-races"> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 43 | Data races -- accessing memory without adequate locking |
| 44 | or synchronisation</link>. |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 45 | </para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 46 | </listitem> |
| 47 | </orderedlist> |
| 48 | |
njn | 05a8917 | 2009-07-29 02:36:21 +0000 | [diff] [blame] | 49 | <para>Problems like these often result in unreproducible, |
| 50 | timing-dependent crashes, deadlocks and other misbehaviour, and |
| 51 | can be difficult to find by other means.</para> |
| 52 | |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 53 | <para>Helgrind is aware of all the pthread abstractions and tracks |
| 54 | their effects as accurately as it can. On x86 and amd64 platforms, it |
| 55 | understands and partially handles implicit locking arising from the |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 56 | use of the LOCK instruction prefix. On PowerPC/POWER and ARM |
| 57 | platforms, it partially handles implicit locking arising from |
| 58 | load-linked and store-conditional instruction pairs. |
njn | 05a8917 | 2009-07-29 02:36:21 +0000 | [diff] [blame] | 59 | </para> |
| 60 | |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 61 | <para>Helgrind works best when your application uses only the POSIX |
| 62 | pthreads API. However, if you want to use custom threading |
| 63 | primitives, you can describe their behaviour to Helgrind using the |
| 64 | <varname>ANNOTATE_*</varname> macros defined |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 65 | in <varname>helgrind.h</varname>.</para> |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 66 | |
| 67 | |
njn | 05a8917 | 2009-07-29 02:36:21 +0000 | [diff] [blame] | 68 | |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 69 | <para>Following those is a section containing |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 70 | <link linkend="hg-manual.effective-use"> |
| 71 | hints and tips on how to get the best out of Helgrind.</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 72 | </para> |
| 73 | |
| 74 | <para>Then there is a |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 75 | <link linkend="hg-manual.options">summary of command-line |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 76 | options.</link> |
| 77 | </para> |
| 78 | |
| 79 | <para>Finally, there is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 80 | <link linkend="hg-manual.todolist">a brief summary of areas in which Helgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 81 | could be improved.</link> |
| 82 | </para> |
| 83 | |
| 84 | </sect1> |
| 85 | |
| 86 | |
| 87 | |
| 88 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 89 | <sect1 id="hg-manual.api-checks" xreflabel="API Checks"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 90 | <title>Detected errors: Misuses of the POSIX pthreads API</title> |
| 91 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 92 | <para>Helgrind intercepts calls to many POSIX pthreads functions, and |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 93 | is therefore able to report on various common problems. Although |
| 94 | these are unglamourous errors, their presence can lead to undefined |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 95 | program behaviour and hard-to-find bugs later on. The detected errors |
| 96 | are:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 97 | |
| 98 | <itemizedlist> |
| 99 | <listitem><para>unlocking an invalid mutex</para></listitem> |
| 100 | <listitem><para>unlocking a not-locked mutex</para></listitem> |
| 101 | <listitem><para>unlocking a mutex held by a different |
| 102 | thread</para></listitem> |
| 103 | <listitem><para>destroying an invalid or a locked mutex</para></listitem> |
| 104 | <listitem><para>recursively locking a non-recursive mutex</para></listitem> |
| 105 | <listitem><para>deallocation of memory that contains a |
| 106 | locked mutex</para></listitem> |
| 107 | <listitem><para>passing mutex arguments to functions expecting |
| 108 | reader-writer lock arguments, and vice |
| 109 | versa</para></listitem> |
| 110 | <listitem><para>when a POSIX pthread function fails with an |
| 111 | error code that must be handled</para></listitem> |
| 112 | <listitem><para>when a thread exits whilst still holding locked |
| 113 | locks</para></listitem> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 114 | <listitem><para>calling <function>pthread_cond_wait</function> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 115 | with a not-locked mutex, an invalid mutex, |
| 116 | or one locked by a different |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 117 | thread</para></listitem> |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 118 | <listitem><para>inconsistent bindings between condition |
| 119 | variables and their associated mutexes</para></listitem> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 120 | <listitem><para>invalid or duplicate initialisation of a pthread |
| 121 | barrier</para></listitem> |
| 122 | <listitem><para>initialisation of a pthread barrier on which threads |
| 123 | are still waiting</para></listitem> |
| 124 | <listitem><para>destruction of a pthread barrier object which was |
| 125 | never initialised, or on which threads are still |
| 126 | waiting</para></listitem> |
| 127 | <listitem><para>waiting on an uninitialised pthread |
| 128 | barrier</para></listitem> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 129 | <listitem><para>for all of the pthreads functions that Helgrind |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 130 | intercepts, an error is reported, along with a stack |
| 131 | trace, if the system threading library routine returns |
| 132 | an error code, even if Helgrind itself detected no |
| 133 | error</para></listitem> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 134 | </itemizedlist> |
| 135 | |
| 136 | <para>Checks pertaining to the validity of mutexes are generally also |
| 137 | performed for reader-writer locks.</para> |
| 138 | |
| 139 | <para>Various kinds of this-can't-possibly-happen events are also |
| 140 | reported. These usually indicate bugs in the system threading |
| 141 | library.</para> |
| 142 | |
| 143 | <para>Reported errors always contain a primary stack trace indicating |
| 144 | where the error was detected. They may also contain auxiliary stack |
| 145 | traces giving additional information. In particular, most errors |
| 146 | relating to mutexes will also tell you where that mutex first came to |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 147 | Helgrind's attention (the "<computeroutput>was first observed |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 148 | at</computeroutput>" part), so you have a chance of figuring out which |
| 149 | mutex it is referring to. For example:</para> |
| 150 | |
| 151 | <programlisting><![CDATA[ |
| 152 | Thread #1 unlocked a not-locked lock at 0x7FEFFFA90 |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 153 | at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.c:492) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 154 | by 0x40073A: nearly_main (tc09_bad_unlock.c:27) |
| 155 | by 0x40079B: main (tc09_bad_unlock.c:50) |
| 156 | Lock at 0x7FEFFFA90 was first observed |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 157 | at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 158 | by 0x40071F: nearly_main (tc09_bad_unlock.c:23) |
| 159 | by 0x40079B: main (tc09_bad_unlock.c:50) |
| 160 | ]]></programlisting> |
| 161 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 162 | <para>Helgrind has a way of summarising thread identities, as |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 163 | you see here with the text "<computeroutput>Thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 164 | #1</computeroutput>". This is so that it can speak about threads and |
| 165 | sets of threads without overwhelming you with details. See |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 166 | <link linkend="hg-manual.data-races.errmsgs">below</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 167 | for more information on interpreting error messages.</para> |
| 168 | |
| 169 | </sect1> |
| 170 | |
| 171 | |
| 172 | |
| 173 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 174 | <sect1 id="hg-manual.lock-orders" xreflabel="Lock Orders"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 175 | <title>Detected errors: Inconsistent Lock Orderings</title> |
| 176 | |
| 177 | <para>In this section, and in general, to "acquire" a lock simply |
| 178 | means to lock that lock, and to "release" a lock means to unlock |
| 179 | it.</para> |
| 180 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 181 | <para>Helgrind monitors the order in which threads acquire locks. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 182 | This allows it to detect potential deadlocks which could arise from |
| 183 | the formation of cycles of locks. Detecting such inconsistencies is |
| 184 | useful because, whilst actual deadlocks are fairly obvious, potential |
| 185 | deadlocks may never be discovered during testing and could later lead |
| 186 | to hard-to-diagnose in-service failures.</para> |
| 187 | |
| 188 | <para>The simplest example of such a problem is as |
| 189 | follows.</para> |
| 190 | |
| 191 | <itemizedlist> |
| 192 | <listitem><para>Imagine some shared resource R, which, for whatever |
| 193 | reason, is guarded by two locks, L1 and L2, which must both be held |
| 194 | when R is accessed.</para> |
| 195 | </listitem> |
| 196 | <listitem><para>Suppose a thread acquires L1, then L2, and proceeds |
| 197 | to access R. The implication of this is that all threads in the |
| 198 | program must acquire the two locks in the order first L1 then L2. |
| 199 | Not doing so risks deadlock.</para> |
| 200 | </listitem> |
| 201 | <listitem><para>The deadlock could happen if two threads -- call them |
| 202 | T1 and T2 -- both want to access R. Suppose T1 acquires L1 first, |
| 203 | and T2 acquires L2 first. Then T1 tries to acquire L2, and T2 tries |
| 204 | to acquire L1, but those locks are both already held. So T1 and T2 |
| 205 | become deadlocked.</para> |
| 206 | </listitem> |
| 207 | </itemizedlist> |
| 208 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 209 | <para>Helgrind builds a directed graph indicating the order in which |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 210 | locks have been acquired in the past. When a thread acquires a new |
| 211 | lock, the graph is updated, and then checked to see if it now contains |
| 212 | a cycle. The presence of a cycle indicates a potential deadlock involving |
| 213 | the locks in the cycle.</para> |
| 214 | |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 215 | <para>In general, Helgrind will choose two locks involved in the cycle |
| 216 | and show you how their acquisition ordering has become inconsistent. |
| 217 | It does this by showing the program points that first defined the |
| 218 | ordering, and the program points which later violated it. Here is a |
| 219 | simple example involving just two locks:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 220 | |
| 221 | <programlisting><![CDATA[ |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 222 | Thread #1: lock order "0x7FF0006D0 before 0x7FF0006A0" violated |
| 223 | |
| 224 | Observed (incorrect) order is: acquisition of lock at 0x7FF0006A0 |
| 225 | at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) |
| 226 | by 0x400825: main (tc13_laog1.c:23) |
| 227 | |
| 228 | followed by a later acquisition of lock at 0x7FF0006D0 |
| 229 | at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) |
| 230 | by 0x400853: main (tc13_laog1.c:24) |
| 231 | |
| 232 | Required order was established by acquisition of lock at 0x7FF0006D0 |
| 233 | at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) |
| 234 | by 0x40076D: main (tc13_laog1.c:17) |
| 235 | |
| 236 | followed by a later acquisition of lock at 0x7FF0006A0 |
| 237 | at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) |
| 238 | by 0x40079B: main (tc13_laog1.c:18) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 239 | ]]></programlisting> |
| 240 | |
| 241 | <para>When there are more than two locks in the cycle, the error is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 242 | equally serious. However, at present Helgrind does not show the locks |
philippe | ebe2580 | 2013-01-30 23:21:34 +0000 | [diff] [blame] | 243 | involved, sometimes because that information is not available, but |
| 244 | also so as to avoid flooding you with information. For example, a |
| 245 | naive implementation of the famous Dining Philosophers problem |
| 246 | involves a cycle of five locks |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 247 | (see <computeroutput>helgrind/tests/tc14_laog_dinphils.c</computeroutput>). |
| 248 | In this case Helgrind has detected that all 5 philosophers could |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 249 | simultaneously pick up their left fork and then deadlock whilst |
| 250 | waiting to pick up their right forks.</para> |
| 251 | |
| 252 | <programlisting><![CDATA[ |
philippe | ebe2580 | 2013-01-30 23:21:34 +0000 | [diff] [blame] | 253 | Thread #6: lock order "0x80499A0 before 0x8049A00" violated |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 254 | |
philippe | ebe2580 | 2013-01-30 23:21:34 +0000 | [diff] [blame] | 255 | Observed (incorrect) order is: acquisition of lock at 0x8049A00 |
| 256 | at 0x40085BC: pthread_mutex_lock (hg_intercepts.c:495) |
| 257 | by 0x80485B4: dine (tc14_laog_dinphils.c:18) |
| 258 | by 0x400BDA4: mythread_wrapper (hg_intercepts.c:219) |
| 259 | by 0x39B924: start_thread (pthread_create.c:297) |
| 260 | by 0x2F107D: clone (clone.S:130) |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 261 | |
philippe | ebe2580 | 2013-01-30 23:21:34 +0000 | [diff] [blame] | 262 | followed by a later acquisition of lock at 0x80499A0 |
| 263 | at 0x40085BC: pthread_mutex_lock (hg_intercepts.c:495) |
| 264 | by 0x80485CD: dine (tc14_laog_dinphils.c:19) |
| 265 | by 0x400BDA4: mythread_wrapper (hg_intercepts.c:219) |
| 266 | by 0x39B924: start_thread (pthread_create.c:297) |
| 267 | by 0x2F107D: clone (clone.S:130) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 268 | ]]></programlisting> |
| 269 | |
| 270 | </sect1> |
| 271 | |
| 272 | |
| 273 | |
| 274 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 275 | <sect1 id="hg-manual.data-races" xreflabel="Data Races"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 276 | <title>Detected errors: Data Races</title> |
| 277 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 278 | <para>A data race happens, or could happen, when two threads access a |
| 279 | shared memory location without using suitable locks or other |
| 280 | synchronisation to ensure single-threaded access. Such missing |
| 281 | locking can cause obscure timing dependent bugs. Ensuring programs |
| 282 | are race-free is one of the central difficulties of threaded |
| 283 | programming.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 284 | |
| 285 | <para>Reliably detecting races is a difficult problem, and most |
sewardj | 49d5a28 | 2011-02-28 10:26:42 +0000 | [diff] [blame] | 286 | of Helgrind's internals are devoted to dealing with it. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 287 | We begin with a simple example.</para> |
| 288 | |
| 289 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 290 | <sect2 id="hg-manual.data-races.example" xreflabel="Simple Race"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 291 | <title>A Simple Data Race</title> |
| 292 | |
| 293 | <para>About the simplest possible example of a race is as follows. In |
| 294 | this program, it is impossible to know what the value |
| 295 | of <computeroutput>var</computeroutput> is at the end of the program. |
| 296 | Is it 2 ? Or 1 ?</para> |
| 297 | |
| 298 | <programlisting><![CDATA[ |
| 299 | #include <pthread.h> |
| 300 | |
| 301 | int var = 0; |
| 302 | |
| 303 | void* child_fn ( void* arg ) { |
| 304 | var++; /* Unprotected relative to parent */ /* this is line 6 */ |
| 305 | return NULL; |
| 306 | } |
| 307 | |
| 308 | int main ( void ) { |
| 309 | pthread_t child; |
| 310 | pthread_create(&child, NULL, child_fn, NULL); |
| 311 | var++; /* Unprotected relative to child */ /* this is line 13 */ |
| 312 | pthread_join(child, NULL); |
| 313 | return 0; |
| 314 | } |
| 315 | ]]></programlisting> |
| 316 | |
| 317 | <para>The problem is there is nothing to |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 318 | stop <varname>var</varname> being updated simultaneously |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 319 | by both threads. A correct program would |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 320 | protect <varname>var</varname> with a lock of type |
| 321 | <function>pthread_mutex_t</function>, which is acquired |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 322 | before each access and released afterwards. Helgrind's output for |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 323 | this program is:</para> |
| 324 | |
| 325 | <programlisting><![CDATA[ |
| 326 | Thread #1 is the program's root thread |
| 327 | |
| 328 | Thread #2 was created |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 329 | at 0x511C08E: clone (in /lib64/libc-2.8.so) |
| 330 | by 0x4E333A4: do_clone (in /lib64/libpthread-2.8.so) |
| 331 | by 0x4E33A30: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.8.so) |
| 332 | by 0x4C299D4: pthread_create@* (hg_intercepts.c:214) |
| 333 | by 0x400605: main (simple_race.c:12) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 334 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 335 | Possible data race during read of size 4 at 0x601038 by thread #1 |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 336 | Locks held: none |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 337 | at 0x400606: main (simple_race.c:13) |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 338 | |
| 339 | This conflicts with a previous write of size 4 by thread #2 |
| 340 | Locks held: none |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 341 | at 0x4005DC: child_fn (simple_race.c:6) |
| 342 | by 0x4C29AFF: mythread_wrapper (hg_intercepts.c:194) |
| 343 | by 0x4E3403F: start_thread (in /lib64/libpthread-2.8.so) |
| 344 | by 0x511C0CC: clone (in /lib64/libc-2.8.so) |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 345 | |
| 346 | Location 0x601038 is 0 bytes inside global var "var" |
| 347 | declared at simple_race.c:3 |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 348 | ]]></programlisting> |
| 349 | |
| 350 | <para>This is quite a lot of detail for an apparently simple error. |
| 351 | The last clause is the main error message. It says there is a race as |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 352 | a result of a read of size 4 (bytes), at 0x601038, which is the |
| 353 | address of <computeroutput>var</computeroutput>, happening in |
| 354 | function <computeroutput>main</computeroutput> at line 13 in the |
| 355 | program.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 356 | |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 357 | <para>Two important parts of the message are:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 358 | |
| 359 | <itemizedlist> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 360 | <listitem> |
| 361 | <para>Helgrind shows two stack traces for the error, not one. By |
| 362 | definition, a race involves two different threads accessing the |
| 363 | same location in such a way that the result depends on the relative |
| 364 | speeds of the two threads.</para> |
| 365 | <para> |
| 366 | The first stack trace follows the text "<computeroutput>Possible |
| 367 | data race during read of size 4 ...</computeroutput>" and the |
| 368 | second trace follows the text "<computeroutput>This conflicts with |
| 369 | a previous write of size 4 ...</computeroutput>". Helgrind is |
| 370 | usually able to show both accesses involved in a race. At least |
| 371 | one of these will be a write (since two concurrent, unsynchronised |
| 372 | reads are harmless), and they will of course be from different |
| 373 | threads.</para> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 374 | <para>By examining your program at the two locations, you should be |
| 375 | able to get at least some idea of what the root cause of the |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 376 | problem is. For each location, Helgrind shows the set of locks |
| 377 | held at the time of the access. This often makes it clear which |
| 378 | thread, if any, failed to take a required lock. In this example |
| 379 | neither thread holds a lock during the access.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 380 | </listitem> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 381 | <listitem> |
| 382 | <para>For races which occur on global or stack variables, Helgrind |
| 383 | tries to identify the name and defining point of the variable. |
| 384 | Hence the text "<computeroutput>Location 0x601038 is 0 bytes inside |
| 385 | global var "var" declared at simple_race.c:3</computeroutput>".</para> |
| 386 | <para>Showing names of stack and global variables carries no |
| 387 | run-time overhead once Helgrind has your program up and running. |
| 388 | However, it does require Helgrind to spend considerable extra time |
| 389 | and memory at program startup to read the relevant debug info. |
| 390 | Hence this facility is disabled by default. To enable it, you need |
njn | a331164 | 2009-08-10 01:29:14 +0000 | [diff] [blame] | 391 | to give the <varname>--read-var-info=yes</varname> option to |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 392 | Helgrind.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 393 | </listitem> |
| 394 | </itemizedlist> |
| 395 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 396 | <para>The following section explains Helgrind's race detection |
| 397 | algorithm in more detail.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 398 | |
| 399 | </sect2> |
| 400 | |
| 401 | |
| 402 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 403 | <sect2 id="hg-manual.data-races.algorithm" xreflabel="DR Algorithm"> |
| 404 | <title>Helgrind's Race Detection Algorithm</title> |
| 405 | |
| 406 | <para>Most programmers think about threaded programming in terms of |
sewardj | 1a620d5 | 2008-12-23 11:13:07 +0000 | [diff] [blame] | 407 | the basic functionality provided by the threading library (POSIX |
| 408 | Pthreads): thread creation, thread joining, locks, condition |
| 409 | variables, semaphores and barriers.</para> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 410 | |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 411 | <para>The effect of using these functions is to impose |
| 412 | constraints upon the order in which memory accesses can |
sewardj | 1a620d5 | 2008-12-23 11:13:07 +0000 | [diff] [blame] | 413 | happen. This implied ordering is generally known as the |
| 414 | "happens-before relation". Once you understand the happens-before |
| 415 | relation, it is easy to see how Helgrind finds races in your code. |
| 416 | Fortunately, the happens-before relation is itself easy to understand, |
| 417 | and is by itself a useful tool for reasoning about the behaviour of |
| 418 | parallel programs. We now introduce it using a simple example.</para> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 419 | |
| 420 | <para>Consider first the following buggy program:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 421 | |
| 422 | <programlisting><![CDATA[ |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 423 | Parent thread: Child thread: |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 424 | |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 425 | int var; |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 426 | |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 427 | // create child thread |
| 428 | pthread_create(...) |
| 429 | var = 20; var = 10; |
| 430 | exit |
| 431 | |
| 432 | // wait for child |
| 433 | pthread_join(...) |
| 434 | printf("%d\n", var); |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 435 | ]]></programlisting> |
| 436 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 437 | <para>The parent thread creates a child. Both then write different |
| 438 | values to some variable <computeroutput>var</computeroutput>, and the |
| 439 | parent then waits for the child to exit.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 440 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 441 | <para>What is the value of <computeroutput>var</computeroutput> at the |
| 442 | end of the program, 10 or 20? We don't know. The program is |
| 443 | considered buggy (it has a race) because the final value |
| 444 | of <computeroutput>var</computeroutput> depends on the relative rates |
| 445 | of progress of the parent and child threads. If the parent is fast |
| 446 | and the child is slow, then the child's assignment may happen later, |
| 447 | so the final value will be 10; and vice versa if the child is faster |
| 448 | than the parent.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 449 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 450 | <para>The relative rates of progress of parent vs child is not something |
| 451 | the programmer can control, and will often change from run to run. |
| 452 | It depends on factors such as the load on the machine, what else is |
| 453 | running, the kernel's scheduling strategy, and many other factors.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 454 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 455 | <para>The obvious fix is to use a lock to |
| 456 | protect <computeroutput>var</computeroutput>. It is however |
| 457 | instructive to consider a somewhat more abstract solution, which is to |
| 458 | send a message from one thread to the other:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 459 | |
| 460 | <programlisting><![CDATA[ |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 461 | Parent thread: Child thread: |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 462 | |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 463 | int var; |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 464 | |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 465 | // create child thread |
| 466 | pthread_create(...) |
| 467 | var = 20; |
| 468 | // send message to child |
| 469 | // wait for message to arrive |
| 470 | var = 10; |
| 471 | exit |
| 472 | |
| 473 | // wait for child |
| 474 | pthread_join(...) |
| 475 | printf("%d\n", var); |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 476 | ]]></programlisting> |
| 477 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 478 | <para>Now the program reliably prints "10", regardless of the speed of |
| 479 | the threads. Why? Because the child's assignment cannot happen until |
| 480 | after it receives the message. And the message is not sent until |
| 481 | after the parent's assignment is done.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 482 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 483 | <para>The message transmission creates a "happens-before" dependency |
| 484 | between the two assignments: <computeroutput>var = 20;</computeroutput> |
| 485 | must now happen-before <computeroutput>var = 10;</computeroutput>. |
| 486 | And so there is no longer a race |
| 487 | on <computeroutput>var</computeroutput>. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 488 | </para> |
| 489 | |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 490 | <para>Note that it's not significant that the parent sends a message |
| 491 | to the child. Sending a message from the child (after its assignment) |
| 492 | to the parent (before its assignment) would also fix the problem, causing |
| 493 | the program to reliably print "20".</para> |
| 494 | |
| 495 | <para>Helgrind's algorithm is (conceptually) very simple. It monitors all |
| 496 | accesses to memory locations. If a location -- in this example, |
| 497 | <computeroutput>var</computeroutput>, |
| 498 | is accessed by two different threads, Helgrind checks to see if the |
sewardj | 1a620d5 | 2008-12-23 11:13:07 +0000 | [diff] [blame] | 499 | two accesses are ordered by the happens-before relation. If so, |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 500 | that's fine; if not, it reports a race.</para> |
| 501 | |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 502 | <para>It is important to understand that the happens-before relation |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 503 | creates only a partial ordering, not a total ordering. An example of |
| 504 | a total ordering is comparison of numbers: for any two numbers |
| 505 | <computeroutput>x</computeroutput> and |
| 506 | <computeroutput>y</computeroutput>, either |
| 507 | <computeroutput>x</computeroutput> is less than, equal to, or greater |
| 508 | than |
| 509 | <computeroutput>y</computeroutput>. A partial ordering is like a |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 510 | total ordering, but it can also express the concept that two elements |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 511 | are neither equal, less or greater, but merely unordered with respect |
| 512 | to each other.</para> |
| 513 | |
| 514 | <para>In the fixed example above, we say that |
| 515 | <computeroutput>var = 20;</computeroutput> "happens-before" |
| 516 | <computeroutput>var = 10;</computeroutput>. But in the original |
| 517 | version, they are unordered: we cannot say that either happens-before |
| 518 | the other.</para> |
| 519 | |
| 520 | <para>What does it mean to say that two accesses from different |
sewardj | 1a620d5 | 2008-12-23 11:13:07 +0000 | [diff] [blame] | 521 | threads are ordered by the happens-before relation? It means that |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 522 | there is some chain of inter-thread synchronisation operations which |
| 523 | cause those accesses to happen in a particular order, irrespective of |
| 524 | the actual rates of progress of the individual threads. This is a |
| 525 | required property for a reliable threaded program, which is why |
| 526 | Helgrind checks for it.</para> |
| 527 | |
| 528 | <para>The happens-before relations created by standard threading |
| 529 | primitives are as follows:</para> |
| 530 | |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 531 | <itemizedlist> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 532 | <listitem><para>When a mutex is unlocked by thread T1 and later (or |
| 533 | immediately) locked by thread T2, then the memory accesses in T1 |
| 534 | prior to the unlock must happen-before those in T2 after it acquires |
| 535 | the lock.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 536 | </listitem> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 537 | <listitem><para>The same idea applies to reader-writer locks, |
| 538 | although with some complication so as to allow correct handling of |
| 539 | reads vs writes.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 540 | </listitem> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 541 | <listitem><para>When a condition variable (CV) is signalled on by |
| 542 | thread T1 and some other thread T2 is thereby released from a wait |
| 543 | on the same CV, then the memory accesses in T1 prior to the |
| 544 | signalling must happen-before those in T2 after it returns from the |
| 545 | wait. If no thread was waiting on the CV then there is no |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 546 | effect.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 547 | </listitem> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 548 | <listitem><para>If instead T1 broadcasts on a CV, then all of the |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 549 | waiting threads, rather than just one of them, acquire a |
| 550 | happens-before dependency on the broadcasting thread at the point it |
| 551 | did the broadcast.</para> |
| 552 | </listitem> |
| 553 | <listitem><para>A thread T2 that continues after completing sem_wait |
| 554 | on a semaphore that thread T1 posts on, acquires a happens-before |
| 555 | dependence on the posting thread, a bit like dependencies caused |
| 556 | mutex unlock-lock pairs. However, since a semaphore can be posted |
| 557 | on many times, it is unspecified from which of the post calls the |
| 558 | wait call gets its happens-before dependency.</para> |
| 559 | </listitem> |
| 560 | <listitem><para>For a group of threads T1 .. Tn which arrive at a |
| 561 | barrier and then move on, each thread after the call has a |
| 562 | happens-after dependency from all threads before the |
| 563 | barrier.</para> |
| 564 | </listitem> |
| 565 | <listitem><para>A newly-created child thread acquires an initial |
| 566 | happens-after dependency on the point where its parent created it. |
| 567 | That is, all memory accesses performed by the parent prior to |
| 568 | creating the child are regarded as happening-before all the accesses |
| 569 | of the child.</para> |
| 570 | </listitem> |
| 571 | <listitem><para>Similarly, when an exiting thread is reaped via a |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 572 | call to <function>pthread_join</function>, once the call returns, the |
| 573 | reaping thread acquires a happens-after dependency relative to all memory |
| 574 | accesses made by the exiting thread.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 575 | </listitem> |
| 576 | </itemizedlist> |
| 577 | |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 578 | <para>In summary: Helgrind intercepts the above listed events, and builds a |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 579 | directed acyclic graph represented the collective happens-before |
| 580 | dependencies. It also monitors all memory accesses.</para> |
| 581 | |
| 582 | <para>If a location is accessed by two different threads, but Helgrind |
| 583 | cannot find any path through the happens-before graph from one access |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 584 | to the other, then it reports a race.</para> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 585 | |
| 586 | <para>There are a couple of caveats:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 587 | |
| 588 | <itemizedlist> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 589 | <listitem><para>Helgrind doesn't check for a race in the case where |
| 590 | both accesses are reads. That would be silly, since concurrent |
| 591 | reads are harmless.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 592 | </listitem> |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 593 | <listitem><para>Two accesses are considered to be ordered by the |
| 594 | happens-before dependency even through arbitrarily long chains of |
| 595 | synchronisation events. For example, if T1 accesses some location |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 596 | L, and then <function>pthread_cond_signals</function> T2, which later |
| 597 | <function>pthread_cond_signals</function> T3, which then accesses L, then |
| 598 | a suitable happens-before dependency exists between the first and second |
sewardj | 7c76839 | 2008-12-21 21:17:24 +0000 | [diff] [blame] | 599 | accesses, even though it involves two different inter-thread |
| 600 | synchronisation events.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 601 | </listitem> |
| 602 | </itemizedlist> |
| 603 | |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 604 | </sect2> |
| 605 | |
| 606 | |
| 607 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 608 | <sect2 id="hg-manual.data-races.errmsgs" xreflabel="Race Error Messages"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 609 | <title>Interpreting Race Error Messages</title> |
| 610 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 611 | <para>Helgrind's race detection algorithm collects a lot of |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 612 | information, and tries to present it in a helpful way when a race is |
| 613 | detected. Here's an example:</para> |
| 614 | |
| 615 | <programlisting><![CDATA[ |
| 616 | Thread #2 was created |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 617 | at 0x511C08E: clone (in /lib64/libc-2.8.so) |
| 618 | by 0x4E333A4: do_clone (in /lib64/libpthread-2.8.so) |
| 619 | by 0x4E33A30: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.8.so) |
| 620 | by 0x4C299D4: pthread_create@* (hg_intercepts.c:214) |
| 621 | by 0x4008F2: main (tc21_pthonce.c:86) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 622 | |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 623 | Thread #3 was created |
| 624 | at 0x511C08E: clone (in /lib64/libc-2.8.so) |
| 625 | by 0x4E333A4: do_clone (in /lib64/libpthread-2.8.so) |
| 626 | by 0x4E33A30: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.8.so) |
| 627 | by 0x4C299D4: pthread_create@* (hg_intercepts.c:214) |
| 628 | by 0x4008F2: main (tc21_pthonce.c:86) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 629 | |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 630 | Possible data race during read of size 4 at 0x601070 by thread #3 |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 631 | Locks held: none |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 632 | at 0x40087A: child (tc21_pthonce.c:74) |
| 633 | by 0x4C29AFF: mythread_wrapper (hg_intercepts.c:194) |
| 634 | by 0x4E3403F: start_thread (in /lib64/libpthread-2.8.so) |
| 635 | by 0x511C0CC: clone (in /lib64/libc-2.8.so) |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 636 | |
| 637 | This conflicts with a previous write of size 4 by thread #2 |
| 638 | Locks held: none |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 639 | at 0x400883: child (tc21_pthonce.c:74) |
| 640 | by 0x4C29AFF: mythread_wrapper (hg_intercepts.c:194) |
| 641 | by 0x4E3403F: start_thread (in /lib64/libpthread-2.8.so) |
| 642 | by 0x511C0CC: clone (in /lib64/libc-2.8.so) |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 643 | |
| 644 | Location 0x601070 is 0 bytes inside local var "unprotected2" |
| 645 | declared at tc21_pthonce.c:51, in frame #0 of thread 3 |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 646 | ]]></programlisting> |
| 647 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 648 | <para>Helgrind first announces the creation points of any threads |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 649 | referenced in the error message. This is so it can speak concisely |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 650 | about threads without repeatedly printing their creation point call |
| 651 | stacks. Each thread is only ever announced once, the first time it |
| 652 | appears in any Helgrind error message.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 653 | |
| 654 | <para>The main error message begins at the text |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 655 | "<computeroutput>Possible data race during read</computeroutput>". At |
| 656 | the start is information you would expect to see -- address and size |
| 657 | of the racing access, whether a read or a write, and the call stack at |
| 658 | the point it was detected.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 659 | |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 660 | <para>A second call stack is presented starting at the text |
| 661 | "<computeroutput>This conflicts with a previous |
| 662 | write</computeroutput>". This shows a previous access which also |
| 663 | accessed the stated address, and which is believed to be racing |
philippe | 5c165b2 | 2012-07-20 23:40:35 +0000 | [diff] [blame] | 664 | against the access in the first call stack. Note that this second |
| 665 | call stack is limited to a maximum of 8 entries to limit the |
| 666 | memory usage.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 667 | |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 668 | <para>Finally, Helgrind may attempt to give a description of the |
| 669 | raced-on address in source level terms. In this example, it |
| 670 | identifies it as a local variable, shows its name, declaration point, |
| 671 | and in which frame (of the first call stack) it lives. Note that this |
| 672 | information is only shown when <varname>--read-var-info=yes</varname> |
| 673 | is specified on the command line. That's because reading the DWARF3 |
| 674 | debug information in enough detail to capture variable type and |
| 675 | location information makes Helgrind much slower at startup, and also |
| 676 | requires considerable amounts of memory, for large programs. |
| 677 | </para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 678 | |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 679 | <para>Once you have your two call stacks, how do you find the root |
| 680 | cause of the race?</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 681 | |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 682 | <para>The first thing to do is examine the source locations referred |
| 683 | to by each call stack. They should both show an access to the same |
| 684 | location, or variable.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 685 | |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 686 | <para>Now figure out how how that location should have been made |
| 687 | thread-safe:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 688 | |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 689 | <itemizedlist> |
| 690 | <listitem><para>Perhaps the location was intended to be protected by |
| 691 | a mutex? If so, you need to lock and unlock the mutex at both |
| 692 | access points, even if one of the accesses is reported to be a read. |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 693 | Did you perhaps forget the locking at one or other of the accesses? |
| 694 | To help you do this, Helgrind shows the set of locks held by each |
| 695 | threads at the time they accessed the raced-on location.</para> |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 696 | </listitem> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 697 | <listitem><para>Alternatively, perhaps you intended to use a some |
| 698 | other scheme to make it safe, such as signalling on a condition |
| 699 | variable. In all such cases, try to find a synchronisation event |
| 700 | (or a chain thereof) which separates the earlier-observed access (as |
| 701 | shown in the second call stack) from the later-observed access (as |
| 702 | shown in the first call stack). In other words, try to find |
| 703 | evidence that the earlier access "happens-before" the later access. |
| 704 | See the previous subsection for an explanation of the happens-before |
sewardj | 1a620d5 | 2008-12-23 11:13:07 +0000 | [diff] [blame] | 705 | relation.</para> |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 706 | <para> |
| 707 | The fact that Helgrind is reporting a race means it did not observe |
sewardj | 1a620d5 | 2008-12-23 11:13:07 +0000 | [diff] [blame] | 708 | any happens-before relation between the two accesses. If |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 709 | Helgrind is working correctly, it should also be the case that you |
sewardj | 1a620d5 | 2008-12-23 11:13:07 +0000 | [diff] [blame] | 710 | also cannot find any such relation, even on detailed inspection |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 711 | of the source code. Hopefully, though, your inspection of the code |
| 712 | will show where the missing synchronisation operation(s) should have |
| 713 | been.</para> |
| 714 | </listitem> |
| 715 | </itemizedlist> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 716 | |
| 717 | </sect2> |
| 718 | |
| 719 | |
| 720 | </sect1> |
| 721 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 722 | <sect1 id="hg-manual.effective-use" xreflabel="Helgrind Effective Use"> |
| 723 | <title>Hints and Tips for Effective Use of Helgrind</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 724 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 725 | <para>Helgrind can be very helpful in finding and resolving |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 726 | threading-related problems. Like all sophisticated tools, it is most |
| 727 | effective when you understand how to play to its strengths.</para> |
| 728 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 729 | <para>Helgrind will be less effective when you merely throw an |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 730 | existing threaded program at it and try to make sense of any reported |
| 731 | errors. It will be more effective if you design threaded programs |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 732 | from the start in a way that helps Helgrind verify correctness. The |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 733 | same is true for finding memory errors with Memcheck, but applies more |
| 734 | here, because thread checking is a harder problem. Consequently it is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 735 | much easier to write a correct program for which Helgrind falsely |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 736 | reports (threading) errors than it is to write a correct program for |
| 737 | which Memcheck falsely reports (memory) errors.</para> |
| 738 | |
| 739 | <para>With that in mind, here are some tips, listed most important first, |
| 740 | for getting reliable results and avoiding false errors. The first two |
| 741 | are critical. Any violations of them will swamp you with huge numbers |
| 742 | of false data-race errors.</para> |
| 743 | |
| 744 | |
| 745 | <orderedlist> |
| 746 | |
| 747 | <listitem> |
| 748 | <para>Make sure your application, and all the libraries it uses, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 749 | use the POSIX threading primitives. Helgrind needs to be able to |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 750 | see all events pertaining to thread creation, exit, locking and |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame] | 751 | other synchronisation events. To do so it intercepts many POSIX |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 752 | pthreads functions.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 753 | |
| 754 | <para>Do not roll your own threading primitives (mutexes, etc) |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 755 | from combinations of the Linux futex syscall, atomic counters, etc. |
| 756 | These throw Helgrind's internal what's-going-on models |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 757 | way off course and will give bogus results.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 758 | |
| 759 | <para>Also, do not reimplement existing POSIX abstractions using |
| 760 | other POSIX abstractions. For example, don't build your own |
| 761 | semaphore routines or reader-writer locks from POSIX mutexes and |
| 762 | condition variables. Instead use POSIX reader-writer locks and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 763 | semaphores directly, since Helgrind supports them directly.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 764 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 765 | <para>Helgrind directly supports the following POSIX threading |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 766 | abstractions: mutexes, reader-writer locks, condition variables |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 767 | (but see below), semaphores and barriers. Currently spinlocks |
| 768 | are not supported, although they could be in future.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 769 | |
| 770 | <para>At the time of writing, the following popular Linux packages |
| 771 | are known to implement their own threading primitives:</para> |
| 772 | |
| 773 | <itemizedlist> |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 774 | <listitem><para>Qt version 4.X. Qt 3.X is harmless in that it |
| 775 | only uses POSIX pthreads primitives. Unfortunately Qt 4.X |
| 776 | has its own implementation of mutexes (QMutex) and thread reaping. |
| 777 | Helgrind 3.4.x contains direct support |
| 778 | for Qt 4.X threading, which is experimental but is believed to |
| 779 | work fairly well. A side effect of supporting Qt 4 directly is |
| 780 | that Helgrind can be used to debug KDE4 applications. As this |
| 781 | is an experimental feature, we would particularly appreciate |
| 782 | feedback from folks who have used Helgrind to successfully debug |
| 783 | Qt 4 and/or KDE4 applications.</para> |
| 784 | </listitem> |
| 785 | <listitem><para>Runtime support library for GNU OpenMP (part of |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 786 | GCC), at least for GCC versions 4.2 and 4.3. The GNU OpenMP runtime |
| 787 | library (<filename>libgomp.so</filename>) constructs its own |
| 788 | synchronisation primitives using combinations of atomic memory |
| 789 | instructions and the futex syscall, which causes total chaos since in |
| 790 | Helgrind since it cannot "see" those.</para> |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 791 | <para>Fortunately, this can be solved using a configuration-time |
njn | a331164 | 2009-08-10 01:29:14 +0000 | [diff] [blame] | 792 | option (for GCC). Rebuild GCC from source, and configure using |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 793 | <varname>--disable-linux-futex</varname>. |
| 794 | This makes libgomp.so use the standard |
| 795 | POSIX threading primitives instead. Note that this was tested |
njn | 7316df2 | 2009-08-04 01:16:01 +0000 | [diff] [blame] | 796 | using GCC 4.2.3 and has not been re-tested using more recent GCC |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 797 | versions. We would appreciate hearing about any successes or |
| 798 | failures with more recent versions.</para> |
| 799 | </listitem> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 800 | </itemizedlist> |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 801 | |
| 802 | <para>If you must implement your own threading primitives, there |
| 803 | are a set of client request macros |
| 804 | in <computeroutput>helgrind.h</computeroutput> to help you |
| 805 | describe your primitives to Helgrind. You should be able to |
| 806 | mark up mutexes, condition variables, etc, without difficulty. |
| 807 | </para> |
| 808 | <para> |
| 809 | It is also possible to mark up the effects of thread-safe |
| 810 | reference counting using the |
| 811 | <computeroutput>ANNOTATE_HAPPENS_BEFORE</computeroutput>, |
| 812 | <computeroutput>ANNOTATE_HAPPENS_AFTER</computeroutput> and |
| 813 | <computeroutput>ANNOTATE_HAPPENS_BEFORE_FORGET_ALL</computeroutput>, |
| 814 | macros. Thread-safe reference counting using an atomically |
| 815 | incremented/decremented refcount variable causes Helgrind |
| 816 | problems because a one-to-zero transition of the reference count |
| 817 | means the accessing thread has exclusive ownership of the |
| 818 | associated resource (normally, a C++ object) and can therefore |
| 819 | access it (normally, to run its destructor) without locking. |
| 820 | Helgrind doesn't understand this, and markup is essential to |
| 821 | avoid false positives. |
| 822 | </para> |
| 823 | |
| 824 | <para> |
| 825 | Here are recommended guidelines for marking up thread safe |
| 826 | reference counting in C++. You only need to mark up your |
| 827 | release methods -- the ones which decrement the reference count. |
| 828 | Given a class like this: |
| 829 | </para> |
| 830 | |
| 831 | <programlisting><![CDATA[ |
| 832 | class MyClass { |
| 833 | unsigned int mRefCount; |
| 834 | |
| 835 | void Release ( void ) { |
| 836 | unsigned int newCount = atomic_decrement(&mRefCount); |
| 837 | if (newCount == 0) { |
| 838 | delete this; |
| 839 | } |
| 840 | } |
| 841 | } |
| 842 | ]]></programlisting> |
| 843 | |
| 844 | <para> |
| 845 | the release method should be marked up as follows: |
| 846 | </para> |
| 847 | |
| 848 | <programlisting><![CDATA[ |
| 849 | void Release ( void ) { |
| 850 | unsigned int newCount = atomic_decrement(&mRefCount); |
| 851 | if (newCount == 0) { |
| 852 | ANNOTATE_HAPPENS_AFTER(&mRefCount); |
| 853 | ANNOTATE_HAPPENS_BEFORE_FORGET_ALL(&mRefCount); |
| 854 | delete this; |
| 855 | } else { |
| 856 | ANNOTATE_HAPPENS_BEFORE(&mRefCount); |
| 857 | } |
| 858 | } |
| 859 | ]]></programlisting> |
| 860 | |
| 861 | <para> |
| 862 | There are a number of complex, mostly-theoretical objections to |
| 863 | this scheme. From a theoretical standpoint it appears to be |
| 864 | impossible to devise a markup scheme which is completely correct |
| 865 | in the sense of guaranteeing to remove all false races. The |
| 866 | proposed scheme however works well in practice. |
| 867 | </para> |
| 868 | |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 869 | </listitem> |
| 870 | |
| 871 | <listitem> |
| 872 | <para>Avoid memory recycling. If you can't avoid it, you must use |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 873 | tell Helgrind what is going on via the |
| 874 | <function>VALGRIND_HG_CLEAN_MEMORY</function> client request (in |
| 875 | <computeroutput>helgrind.h</computeroutput>).</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 876 | |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 877 | <para>Helgrind is aware of standard heap memory allocation and |
| 878 | deallocation that occurs via |
| 879 | <function>malloc</function>/<function>free</function>/<function>new</function>/<function>delete</function> |
| 880 | and from entry and exit of stack frames. In particular, when memory is |
| 881 | deallocated via <function>free</function>, <function>delete</function>, |
| 882 | or function exit, Helgrind considers that memory clean, so when it is |
| 883 | eventually reallocated, its history is irrelevant.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 884 | |
| 885 | <para>However, it is common practice to implement memory recycling |
| 886 | schemes. In these, memory to be freed is not handed to |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 887 | <function>free</function>/<function>delete</function>, but instead put |
| 888 | into a pool of free buffers to be handed out again as required. The |
| 889 | problem is that Helgrind has no |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 890 | way to know that such memory is logically no longer in use, and |
| 891 | its history is irrelevant. Hence you must make that explicit, |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 892 | using the <function>VALGRIND_HG_CLEAN_MEMORY</function> client request |
| 893 | to specify the relevant address ranges. It's easiest to put these |
| 894 | requests into the pool manager code, and use them either when memory is |
| 895 | returned to the pool, or is allocated from it.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 896 | </listitem> |
| 897 | |
| 898 | <listitem> |
| 899 | <para>Avoid POSIX condition variables. If you can, use POSIX |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 900 | semaphores (<function>sem_t</function>, <function>sem_post</function>, |
| 901 | <function>sem_wait</function>) to do inter-thread event signalling. |
| 902 | Semaphores with an initial value of zero are particularly useful for |
| 903 | this.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 904 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 905 | <para>Helgrind only partially correctly handles POSIX condition |
| 906 | variables. This is because Helgrind can see inter-thread |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 907 | dependencies between a <function>pthread_cond_wait</function> call and a |
| 908 | <function>pthread_cond_signal</function>/<function>pthread_cond_broadcast</function> |
| 909 | call only if the waiting thread actually gets to the rendezvous first |
| 910 | (so that it actually calls |
| 911 | <function>pthread_cond_wait</function>). It can't see dependencies |
| 912 | between the threads if the signaller arrives first. In the latter case, |
| 913 | POSIX guidelines imply that the associated boolean condition still |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 914 | provides an inter-thread synchronisation event, but one which is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 915 | invisible to Helgrind.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 916 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 917 | <para>The result of Helgrind missing some inter-thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 918 | synchronisation events is to cause it to report false positives. |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 919 | </para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 920 | |
| 921 | <para>The root cause of this synchronisation lossage is |
| 922 | particularly hard to understand, so an example is helpful. It was |
| 923 | discussed at length by Arndt Muehlenfeld ("Runtime Race Detection |
| 924 | in Multi-Threaded Programs", Dissertation, TU Graz, Austria). The |
| 925 | canonical POSIX-recommended usage scheme for condition variables |
| 926 | is as follows:</para> |
| 927 | |
| 928 | <programlisting><![CDATA[ |
| 929 | b is a Boolean condition, which is False most of the time |
| 930 | cv is a condition variable |
| 931 | mx is its associated mutex |
| 932 | |
| 933 | Signaller: Waiter: |
| 934 | |
| 935 | lock(mx) lock(mx) |
| 936 | b = True while (b == False) |
| 937 | signal(cv) wait(cv,mx) |
| 938 | unlock(mx) unlock(mx) |
| 939 | ]]></programlisting> |
| 940 | |
| 941 | <para>Assume <computeroutput>b</computeroutput> is False most of |
| 942 | the time. If the waiter arrives at the rendezvous first, it |
| 943 | enters its while-loop, waits for the signaller to signal, and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 944 | eventually proceeds. Helgrind sees the signal, notes the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 945 | dependency, and all is well.</para> |
| 946 | |
| 947 | <para>If the signaller arrives |
| 948 | first, <computeroutput>b</computeroutput> is set to true, and the |
| 949 | signal disappears into nowhere. When the waiter later arrives, it |
| 950 | does not enter its while-loop and simply carries on. But even in |
| 951 | this case, the waiter code following the while-loop cannot execute |
| 952 | until the signaller sets <computeroutput>b</computeroutput> to |
| 953 | True. Hence there is still the same inter-thread dependency, but |
| 954 | this time it is through an arbitrary in-memory condition, and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 955 | Helgrind cannot see it.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 956 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 957 | <para>By comparison, Helgrind's detection of inter-thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 958 | dependencies caused by semaphore operations is believed to be |
| 959 | exactly correct.</para> |
| 960 | |
| 961 | <para>As far as I know, a solution to this problem that does not |
| 962 | require source-level annotation of condition-variable wait loops |
| 963 | is beyond the current state of the art.</para> |
| 964 | </listitem> |
| 965 | |
| 966 | <listitem> |
| 967 | <para>Make sure you are using a supported Linux distribution. At |
sewardj | 5246990 | 2008-12-21 23:11:14 +0000 | [diff] [blame] | 968 | present, Helgrind only properly supports glibc-2.3 or later. This |
| 969 | in turn means we only support glibc's NPTL threading |
| 970 | implementation. The old LinuxThreads implementation is not |
| 971 | supported.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 972 | </listitem> |
| 973 | |
| 974 | <listitem> |
philippe | 9848690 | 2014-08-19 22:46:44 +0000 | [diff] [blame] | 975 | <para>If your application is using thread local variables, |
| 976 | helgrind might report false positive race conditions on these |
| 977 | variables, despite being very probably race free. On Linux, you can |
| 978 | use <option>--sim-hints=deactivate-pthread-stack-cache-via-hack</option> |
| 979 | to avoid such false positive error messages |
| 980 | (see <xref linkend="opt.sim-hints"/>). |
| 981 | </para> |
| 982 | </listitem> |
| 983 | |
| 984 | <listitem> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 985 | <para>Round up all finished threads using |
| 986 | <function>pthread_join</function>. Avoid |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 987 | detaching threads: don't create threads in the detached state, and |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 988 | don't call <function>pthread_detach</function> on existing threads.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 989 | |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 990 | <para>Using <function>pthread_join</function> to round up finished |
| 991 | threads provides a clear synchronisation point that both Helgrind and |
| 992 | programmers can see. If you don't call |
| 993 | <function>pthread_join</function> on a thread, Helgrind has no way to |
| 994 | know when it finishes, relative to any |
| 995 | significant synchronisation points for other threads in the program. So |
| 996 | it assumes that the thread lingers indefinitely and can potentially |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 997 | interfere indefinitely with the memory state of the program. It |
| 998 | has every right to assume that -- after all, it might really be |
| 999 | the case that, for scheduling reasons, the exiting thread did run |
| 1000 | very slowly in the last stages of its life.</para> |
| 1001 | </listitem> |
| 1002 | |
| 1003 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1004 | <para>Perform thread debugging (with Helgrind) and memory |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1005 | debugging (with Memcheck) together.</para> |
| 1006 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1007 | <para>Helgrind tracks the state of memory in detail, and memory |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1008 | management bugs in the application are liable to cause confusion. |
| 1009 | In extreme cases, applications which do many invalid reads and |
| 1010 | writes (particularly to freed memory) have been known to crash |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1011 | Helgrind. So, ideally, you should make your application |
| 1012 | Memcheck-clean before using Helgrind.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1013 | |
| 1014 | <para>It may be impossible to make your application Memcheck-clean |
| 1015 | unless you first remove threading bugs. In particular, it may be |
| 1016 | difficult to remove all reads and writes to freed memory in |
| 1017 | multithreaded C++ destructor sequences at program termination. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1018 | So, ideally, you should make your application Helgrind-clean |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1019 | before using Memcheck.</para> |
| 1020 | |
| 1021 | <para>Since this circularity is obviously unresolvable, at least |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1022 | bear in mind that Memcheck and Helgrind are to some extent |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1023 | complementary, and you may need to use them together.</para> |
| 1024 | </listitem> |
| 1025 | |
| 1026 | <listitem> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1027 | <para>POSIX requires that implementations of standard I/O |
| 1028 | (<function>printf</function>, <function>fprintf</function>, |
| 1029 | <function>fwrite</function>, <function>fread</function>, etc) are thread |
| 1030 | safe. Unfortunately GNU libc implements this by using internal locking |
| 1031 | primitives that Helgrind is unable to intercept. Consequently Helgrind |
| 1032 | generates many false race reports when you use these functions.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1033 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1034 | <para>Helgrind attempts to hide these errors using the standard |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1035 | Valgrind error-suppression mechanism. So, at least for simple |
| 1036 | test cases, you don't see any. Nevertheless, some may slip |
| 1037 | through. Just something to be aware of.</para> |
| 1038 | </listitem> |
| 1039 | |
| 1040 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1041 | <para>Helgrind's error checks do not work properly inside the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1042 | system threading library itself |
| 1043 | (<computeroutput>libpthread.so</computeroutput>), and it usually |
| 1044 | observes large numbers of (false) errors in there. Valgrind's |
| 1045 | suppression system then filters these out, so you should not see |
| 1046 | them.</para> |
| 1047 | |
| 1048 | <para>If you see any race errors reported |
| 1049 | where <computeroutput>libpthread.so</computeroutput> or |
| 1050 | <computeroutput>ld.so</computeroutput> is the object associated |
| 1051 | with the innermost stack frame, please file a bug report at |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1052 | <ulink url="&vg-url;">&vg-url;</ulink>. |
| 1053 | </para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1054 | </listitem> |
| 1055 | |
| 1056 | </orderedlist> |
| 1057 | |
| 1058 | </sect1> |
| 1059 | |
| 1060 | |
| 1061 | |
| 1062 | |
njn | a331164 | 2009-08-10 01:29:14 +0000 | [diff] [blame] | 1063 | <sect1 id="hg-manual.options" xreflabel="Helgrind Command-line Options"> |
| 1064 | <title>Helgrind Command-line Options</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1065 | |
| 1066 | <para>The following end-user options are available:</para> |
| 1067 | |
| 1068 | <!-- start of xi:include in the manpage --> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1069 | <variablelist id="hg.opts.list"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1070 | |
sewardj | 622fe49 | 2011-03-11 21:06:59 +0000 | [diff] [blame] | 1071 | <varlistentry id="opt.free-is-write" |
| 1072 | xreflabel="--free-is-write"> |
| 1073 | <term> |
| 1074 | <option><![CDATA[--free-is-write=no|yes |
| 1075 | [default: no] ]]></option> |
| 1076 | </term> |
| 1077 | <listitem> |
| 1078 | <para>When enabled (not the default), Helgrind treats freeing of |
| 1079 | heap memory as if the memory was written immediately before |
| 1080 | the free. This exposes races where memory is referenced by |
| 1081 | one thread, and freed by another, but there is no observable |
| 1082 | synchronisation event to ensure that the reference happens |
| 1083 | before the free. |
| 1084 | </para> |
| 1085 | <para>This functionality is new in Valgrind 3.7.0, and is |
| 1086 | regarded as experimental. It is not enabled by default |
| 1087 | because its interaction with custom memory allocators is not |
| 1088 | well understood at present. User feedback is welcomed. |
| 1089 | </para> |
| 1090 | </listitem> |
| 1091 | </varlistentry> |
| 1092 | |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1093 | <varlistentry id="opt.track-lockorders" |
| 1094 | xreflabel="--track-lockorders"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1095 | <term> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1096 | <option><![CDATA[--track-lockorders=no|yes |
| 1097 | [default: yes] ]]></option> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1098 | </term> |
| 1099 | <listitem> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1100 | <para>When enabled (the default), Helgrind performs lock order |
| 1101 | consistency checking. For some buggy programs, the large number |
| 1102 | of lock order errors reported can become annoying, particularly |
| 1103 | if you're only interested in race errors. You may therefore find |
| 1104 | it helpful to disable lock order checking.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1105 | </listitem> |
| 1106 | </varlistentry> |
| 1107 | |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1108 | <varlistentry id="opt.history-level" |
| 1109 | xreflabel="--history-level"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1110 | <term> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1111 | <option><![CDATA[--history-level=none|approx|full |
| 1112 | [default: full] ]]></option> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1113 | </term> |
| 1114 | <listitem> |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 1115 | <para><option>--history-level=full</option> (the default) causes |
| 1116 | Helgrind collects enough information about "old" accesses that |
| 1117 | it can produce two stack traces in a race report -- both the |
| 1118 | stack trace for the current access, and the trace for the |
philippe | 5c165b2 | 2012-07-20 23:40:35 +0000 | [diff] [blame] | 1119 | older, conflicting access. To limit memory usage, "old" accesses |
| 1120 | stack traces are limited to a maximum of 8 entries, even if |
| 1121 | <option>--num-callers</option> value is bigger.</para> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1122 | <para>Collecting such information is expensive in both speed and |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 1123 | memory, particularly for programs that do many inter-thread |
| 1124 | synchronisation events (locks, unlocks, etc). Without such |
| 1125 | information, it is more difficult to track down the root |
| 1126 | causes of races. Nonetheless, you may not need it in |
| 1127 | situations where you just want to check for the presence or |
| 1128 | absence of races, for example, when doing regression testing |
| 1129 | of a previously race-free program.</para> |
| 1130 | <para><option>--history-level=none</option> is the opposite |
| 1131 | extreme. It causes Helgrind not to collect any information |
| 1132 | about previous accesses. This can be dramatically faster |
| 1133 | than <option>--history-level=full</option>.</para> |
| 1134 | <para><option>--history-level=approx</option> provides a |
| 1135 | compromise between these two extremes. It causes Helgrind to |
| 1136 | show a full trace for the later access, and approximate |
| 1137 | information regarding the earlier access. This approximate |
| 1138 | information consists of two stacks, and the earlier access is |
| 1139 | guaranteed to have occurred somewhere between program points |
| 1140 | denoted by the two stacks. This is not as useful as showing |
| 1141 | the exact stack for the previous access |
| 1142 | (as <option>--history-level=full</option> does), but it is |
| 1143 | better than nothing, and it is almost as fast as |
| 1144 | <option>--history-level=none</option>.</para> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1145 | </listitem> |
| 1146 | </varlistentry> |
| 1147 | |
| 1148 | <varlistentry id="opt.conflict-cache-size" |
| 1149 | xreflabel="--conflict-cache-size"> |
| 1150 | <term> |
| 1151 | <option><![CDATA[--conflict-cache-size=N |
| 1152 | [default: 1000000] ]]></option> |
| 1153 | </term> |
| 1154 | <listitem> |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 1155 | <para>This flag only has any effect |
| 1156 | at <option>--history-level=full</option>.</para> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1157 | <para>Information about "old" conflicting accesses is stored in |
| 1158 | a cache of limited size, with LRU-style management. This is |
| 1159 | necessary because it isn't practical to store a stack trace |
| 1160 | for every single memory access made by the program. |
| 1161 | Historical information on not recently accessed locations is |
| 1162 | periodically discarded, to free up space in the cache.</para> |
njn | a331164 | 2009-08-10 01:29:14 +0000 | [diff] [blame] | 1163 | <para>This option controls the size of the cache, in terms of the |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1164 | number of different memory addresses for which |
| 1165 | conflicting access information is stored. If you find that |
| 1166 | Helgrind is showing race errors with only one stack instead of |
| 1167 | the expected two stacks, try increasing this value.</para> |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 1168 | <para>The minimum value is 10,000 and the maximum is 30,000,000 |
| 1169 | (thirty times the default value). Increasing the value by 1 |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1170 | increases Helgrind's memory requirement by very roughly 100 |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 1171 | bytes, so the maximum value will easily eat up three extra |
sewardj | 78bb7f6 | 2009-08-14 21:33:34 +0000 | [diff] [blame] | 1172 | gigabytes or so of memory.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1173 | </listitem> |
| 1174 | </varlistentry> |
| 1175 | |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 1176 | <varlistentry id="opt.check-stack-refs" |
| 1177 | xreflabel="--check-stack-refs"> |
| 1178 | <term> |
| 1179 | <option><![CDATA[--check-stack-refs=no|yes |
| 1180 | [default: yes] ]]></option> |
| 1181 | </term> |
| 1182 | <listitem> |
| 1183 | <para> |
| 1184 | By default Helgrind checks all data memory accesses made by your |
| 1185 | program. This flag enables you to skip checking for accesses |
| 1186 | to thread stacks (local variables). This can improve |
| 1187 | performance, but comes at the cost of missing races on |
| 1188 | stack-allocated data. |
| 1189 | </para> |
| 1190 | </listitem> |
| 1191 | </varlistentry> |
| 1192 | |
sewardj | 8eb8bab | 2015-07-21 14:44:28 +0000 | [diff] [blame] | 1193 | <varlistentry id="opt.ignore-thread-creation" |
| 1194 | xreflabel="--ignore-thread-creation"> |
| 1195 | <term> |
| 1196 | <option><![CDATA[--ignore-thread-creation=<yes|no> |
| 1197 | [default: no]]]></option> |
| 1198 | </term> |
| 1199 | <listitem> |
| 1200 | <para> |
| 1201 | Controls whether all activities during thread creation should be |
| 1202 | ignored. By default enabled only on Solaris. |
| 1203 | Solaris provides higher throughput, parallelism and scalability than |
| 1204 | other operating systems, at the cost of more fine-grained locking |
| 1205 | activity. This means for example that when a thread is created under |
| 1206 | glibc, just one big lock is used for all thread setup. Solaris libc |
| 1207 | uses several fine-grained locks and the creator thread resumes its |
| 1208 | activities as soon as possible, leaving for example stack and TLS setup |
| 1209 | sequence to the created thread. |
| 1210 | This situation confuses Helgrind as it assumes there is some false |
| 1211 | ordering in place between creator and created thread; and therefore many |
| 1212 | types of race conditions in the application would not be reported. |
| 1213 | To prevent such false ordering, this command line option is set to |
| 1214 | <computeroutput>yes</computeroutput> by default on Solaris. |
| 1215 | All activity (loads, stores, client requests) is therefore ignored |
| 1216 | during:</para> |
| 1217 | <itemizedlist> |
| 1218 | <listitem> |
| 1219 | <para> |
| 1220 | pthread_create() call in the creator thread |
| 1221 | </para> |
| 1222 | </listitem> |
| 1223 | <listitem> |
| 1224 | <para> |
| 1225 | thread creation phase (stack and TLS setup) in the created thread |
| 1226 | </para> |
| 1227 | </listitem> |
| 1228 | </itemizedlist> |
| 1229 | <para> |
| 1230 | Also new memory allocated during thread creation is untracked, |
| 1231 | that is race reporting is suppressed there. DRD does the same thing |
| 1232 | implicitly. This is necessary because Solaris libc caches many objects |
| 1233 | and reuses them for different threads and that confuses |
| 1234 | Helgrind.</para> |
| 1235 | </listitem> |
| 1236 | </varlistentry> |
| 1237 | |
sewardj | 70ceabc | 2011-06-24 18:23:42 +0000 | [diff] [blame] | 1238 | |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1239 | </variablelist> |
| 1240 | <!-- end of xi:include in the manpage --> |
| 1241 | |
| 1242 | <!-- start of xi:include in the manpage --> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1243 | <!-- commented out, because we don't document debugging options in the |
| 1244 | manual. Nb: all the double-dashes below had a space inserted in them |
| 1245 | to avoid problems with premature closing of this comment. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1246 | <para>In addition, the following debugging options are available for |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1247 | Helgrind:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1248 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1249 | <variablelist id="hg.debugopts.list"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1250 | |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1251 | <varlistentry id="opt.trace-malloc" xreflabel="- -trace-malloc"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1252 | <term> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1253 | <option><![CDATA[- -trace-malloc=no|yes [no] |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1254 | ]]></option> |
| 1255 | </term> |
| 1256 | <listitem> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1257 | <para>Show all client <function>malloc</function> (etc) and |
| 1258 | <function>free</function> (etc) requests.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1259 | </listitem> |
| 1260 | </varlistentry> |
| 1261 | |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1262 | <varlistentry id="opt.cmp-race-err-addrs" |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1263 | xreflabel="- -cmp-race-err-addrs"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1264 | <term> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1265 | <option><![CDATA[- -cmp-race-err-addrs=no|yes [no] |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1266 | ]]></option> |
| 1267 | </term> |
| 1268 | <listitem> |
| 1269 | <para>Controls whether or not race (data) addresses should be |
| 1270 | taken into account when removing duplicates of race errors. |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1271 | With <varname>- -cmp-race-err-addrs=no</varname>, two otherwise |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1272 | identical race errors will be considered to be the same if |
| 1273 | their race addresses differ. With |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1274 | With <varname>- -cmp-race-err-addrs=yes</varname> they will be |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1275 | considered different. This is provided to help make certain |
| 1276 | regression tests work reliably.</para> |
| 1277 | </listitem> |
| 1278 | </varlistentry> |
| 1279 | |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1280 | <varlistentry id="opt.hg-sanity-flags" xreflabel="- -hg-sanity-flags"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1281 | <term> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1282 | <option><![CDATA[- -hg-sanity-flags=<XXXXXX> (X = 0|1) [000000] |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1283 | ]]></option> |
| 1284 | </term> |
| 1285 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1286 | <para>Run extensive sanity checks on Helgrind's internal |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1287 | data structures at events defined by the bitstring, as |
| 1288 | follows:</para> |
sewardj | 11e352f | 2007-11-30 11:11:02 +0000 | [diff] [blame] | 1289 | <para><computeroutput>010000 </computeroutput>after changes to |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1290 | the lock order acquisition graph</para> |
sewardj | 11e352f | 2007-11-30 11:11:02 +0000 | [diff] [blame] | 1291 | <para><computeroutput>001000 </computeroutput>after every client |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1292 | memory access (NB: not currently used)</para> |
sewardj | 11e352f | 2007-11-30 11:11:02 +0000 | [diff] [blame] | 1293 | <para><computeroutput>000100 </computeroutput>after every client |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1294 | memory range permission setting of 256 bytes or greater</para> |
sewardj | 11e352f | 2007-11-30 11:11:02 +0000 | [diff] [blame] | 1295 | <para><computeroutput>000010 </computeroutput>after every client |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1296 | lock or unlock event</para> |
sewardj | 11e352f | 2007-11-30 11:11:02 +0000 | [diff] [blame] | 1297 | <para><computeroutput>000001 </computeroutput>after every client |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1298 | thread creation or joinage event</para> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1299 | <para>Note these will make Helgrind run very slowly, often to |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1300 | the point of being completely unusable.</para> |
| 1301 | </listitem> |
| 1302 | </varlistentry> |
| 1303 | |
| 1304 | </variablelist> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1305 | --> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1306 | <!-- end of xi:include in the manpage --> |
| 1307 | |
| 1308 | |
| 1309 | </sect1> |
| 1310 | |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1311 | |
philippe | f577434 | 2014-05-03 11:12:50 +0000 | [diff] [blame] | 1312 | <sect1 id="hg-manual.monitor-commands" xreflabel="Helgrind Monitor Commands"> |
| 1313 | <title>Helgrind Monitor Commands</title> |
| 1314 | <para>The Helgrind tool provides monitor commands handled by Valgrind's |
| 1315 | built-in gdbserver (see <xref linkend="manual-core-adv.gdbserver-commandhandling"/>). |
| 1316 | </para> |
| 1317 | <itemizedlist> |
| 1318 | <listitem> |
philippe | 328d662 | 2015-05-25 17:24:27 +0000 | [diff] [blame] | 1319 | <para><varname>info locks [lock_addr]</varname> shows the list of locks |
| 1320 | and their status. If <varname>lock_addr</varname> is given, only shows |
| 1321 | the lock located at this address. </para> |
philippe | f577434 | 2014-05-03 11:12:50 +0000 | [diff] [blame] | 1322 | <para> |
philippe | 328d662 | 2015-05-25 17:24:27 +0000 | [diff] [blame] | 1323 | In the following example, helgrind knows about one lock. This |
| 1324 | lock is located at the guest address <varname>ga |
| 1325 | 0x8049a20</varname>. The lock kind is <varname>rdwr</varname> |
| 1326 | indicating a reader-writer lock. Other possible lock kinds |
| 1327 | are <varname>nonRec</varname> (simple mutex, non recursive) |
philippe | f577434 | 2014-05-03 11:12:50 +0000 | [diff] [blame] | 1328 | and <varname>mbRec</varname> (simple mutex, possibly recursive). |
philippe | 328d662 | 2015-05-25 17:24:27 +0000 | [diff] [blame] | 1329 | The lock kind is then followed by the list of threads helding the |
| 1330 | lock. In the below example, <varname>R1:thread #6 tid 3</varname> |
| 1331 | indicates that the helgrind thread #6 has acquired (once, as the |
| 1332 | counter following the letter R is one) the lock in read mode. The |
| 1333 | helgrind thread nr is incremented for each started thread. The |
| 1334 | presence of 'tid 3' indicates that the thread #6 is has not exited |
| 1335 | yet and is the valgrind tid 3. If a thread has terminated, then |
| 1336 | this is indicated with 'tid (exited)'. |
philippe | f577434 | 2014-05-03 11:12:50 +0000 | [diff] [blame] | 1337 | </para> |
| 1338 | <programlisting><![CDATA[ |
| 1339 | (gdb) monitor info locks |
| 1340 | Lock ga 0x8049a20 { |
| 1341 | kind rdwr |
| 1342 | { R1:thread #6 tid 3 } |
| 1343 | } |
| 1344 | (gdb) |
| 1345 | ]]></programlisting> |
| 1346 | |
philippe | 328d662 | 2015-05-25 17:24:27 +0000 | [diff] [blame] | 1347 | <para> If you give the option <varname>--read-var-info=yes</varname>, |
| 1348 | then more information will be provided about the lock location, such as |
| 1349 | the global variable or the heap block that contains the lock: |
philippe | f577434 | 2014-05-03 11:12:50 +0000 | [diff] [blame] | 1350 | </para> |
| 1351 | <programlisting><![CDATA[ |
| 1352 | Lock ga 0x8049a20 { |
philippe | 07c0852 | 2014-05-14 20:39:27 +0000 | [diff] [blame] | 1353 | Location 0x8049a20 is 0 bytes inside global var "s_rwlock" |
| 1354 | declared at rwlock_race.c:17 |
philippe | f577434 | 2014-05-03 11:12:50 +0000 | [diff] [blame] | 1355 | kind rdwr |
| 1356 | { R1:thread #3 tid 3 } |
| 1357 | } |
| 1358 | ]]></programlisting> |
| 1359 | |
| 1360 | </listitem> |
| 1361 | |
philippe | 328d662 | 2015-05-25 17:24:27 +0000 | [diff] [blame] | 1362 | <listitem> |
| 1363 | <para><varname>accesshistory <addr> [<len>]</varname> |
| 1364 | shows the access history recorded for <len> (default 1) bytes |
| 1365 | starting at <addr>. For each recorded access that overlaps |
| 1366 | with the given range, <varname>accesshistory</varname> shows the operation |
| 1367 | type (read or write), the address and size read or written, the helgrind |
| 1368 | thread nr/valgrind tid number that did the operation and the locks held |
| 1369 | by the thread at the time of the operation. |
| 1370 | The oldest access is shown first, the most recent access is shown last. |
| 1371 | </para> |
| 1372 | <para> |
| 1373 | In the following example, we see first a recorded write of 4 bytes by |
| 1374 | thread #7 that has modified the given 2 bytes range. |
| 1375 | The second recorded write is the most recent recorded write : thread #9 |
| 1376 | modified the same 2 bytes as part of a 4 bytes write operation. |
| 1377 | The list of locks held by each thread at the time of the write operation |
| 1378 | are also shown. |
| 1379 | </para> |
| 1380 | <programlisting><![CDATA[ |
| 1381 | (gdb) monitor accesshistory 0x8049D8A 2 |
| 1382 | write of size 4 at 0x8049D88 by thread #7 tid 3 |
| 1383 | ==6319== Locks held: 2, at address 0x8049D8C (and 1 that can't be shown) |
| 1384 | ==6319== at 0x804865F: child_fn1 (locked_vs_unlocked2.c:29) |
| 1385 | ==6319== by 0x400AE61: mythread_wrapper (hg_intercepts.c:234) |
| 1386 | ==6319== by 0x39B924: start_thread (pthread_create.c:297) |
| 1387 | ==6319== by 0x2F107D: clone (clone.S:130) |
| 1388 | |
| 1389 | write of size 4 at 0x8049D88 by thread #9 tid 2 |
| 1390 | ==6319== Locks held: 2, at addresses 0x8049DA4 0x8049DD4 |
| 1391 | ==6319== at 0x804877B: child_fn2 (locked_vs_unlocked2.c:45) |
| 1392 | ==6319== by 0x400AE61: mythread_wrapper (hg_intercepts.c:234) |
| 1393 | ==6319== by 0x39B924: start_thread (pthread_create.c:297) |
| 1394 | ==6319== by 0x2F107D: clone (clone.S:130) |
| 1395 | |
| 1396 | ]]></programlisting> |
| 1397 | |
| 1398 | </listitem> |
| 1399 | |
philippe | f577434 | 2014-05-03 11:12:50 +0000 | [diff] [blame] | 1400 | </itemizedlist> |
| 1401 | |
| 1402 | </sect1> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1403 | |
| 1404 | <sect1 id="hg-manual.client-requests" xreflabel="Helgrind Client Requests"> |
| 1405 | <title>Helgrind Client Requests</title> |
| 1406 | |
| 1407 | <para>The following client requests are defined in |
| 1408 | <filename>helgrind.h</filename>. See that file for exact details of their |
| 1409 | arguments.</para> |
| 1410 | |
| 1411 | <itemizedlist> |
| 1412 | |
| 1413 | <listitem> |
sewardj | 3d49844 | 2009-08-16 22:47:02 +0000 | [diff] [blame] | 1414 | <para><function>VALGRIND_HG_CLEAN_MEMORY</function></para> |
| 1415 | <para>This makes Helgrind forget everything it knows about a |
| 1416 | specified memory range. This is particularly useful for memory |
| 1417 | allocators that wish to recycle memory.</para> |
| 1418 | </listitem> |
| 1419 | <listitem> |
| 1420 | <para><function>ANNOTATE_HAPPENS_BEFORE</function></para> |
| 1421 | </listitem> |
| 1422 | <listitem> |
| 1423 | <para><function>ANNOTATE_HAPPENS_AFTER</function></para> |
| 1424 | </listitem> |
| 1425 | <listitem> |
| 1426 | <para><function>ANNOTATE_NEW_MEMORY</function></para> |
| 1427 | </listitem> |
| 1428 | <listitem> |
| 1429 | <para><function>ANNOTATE_RWLOCK_CREATE</function></para> |
| 1430 | </listitem> |
| 1431 | <listitem> |
| 1432 | <para><function>ANNOTATE_RWLOCK_DESTROY</function></para> |
| 1433 | </listitem> |
| 1434 | <listitem> |
| 1435 | <para><function>ANNOTATE_RWLOCK_ACQUIRED</function></para> |
| 1436 | </listitem> |
| 1437 | <listitem> |
| 1438 | <para><function>ANNOTATE_RWLOCK_RELEASED</function></para> |
| 1439 | <para>These are used to describe to Helgrind, the behaviour of |
| 1440 | custom (non-POSIX) synchronisation primitives, which it otherwise |
| 1441 | has no way to understand. See comments |
| 1442 | in <filename>helgrind.h</filename> for further |
| 1443 | documentation.</para> |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1444 | </listitem> |
| 1445 | |
| 1446 | </itemizedlist> |
| 1447 | |
| 1448 | </sect1> |
| 1449 | |
| 1450 | |
| 1451 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1452 | <sect1 id="hg-manual.todolist" xreflabel="To Do List"> |
| 1453 | <title>A To-Do List for Helgrind</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1454 | |
| 1455 | <para>The following is a list of loose ends which should be tidied up |
| 1456 | some time.</para> |
| 1457 | |
| 1458 | <itemizedlist> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1459 | <listitem><para>For lock order errors, print the complete lock |
| 1460 | cycle, rather than only doing for size-2 cycles as at |
| 1461 | present.</para> |
| 1462 | </listitem> |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1463 | <listitem><para>The conflicting access mechanism sometimes |
| 1464 | mysteriously fails to show the conflicting access' stack, even |
| 1465 | when provided with unbounded storage for conflicting access info. |
| 1466 | This should be investigated.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1467 | </listitem> |
njn | 7316df2 | 2009-08-04 01:16:01 +0000 | [diff] [blame] | 1468 | <listitem><para>Document races caused by GCC's thread-unsafe code |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1469 | generation for speculative stores. In the interim see |
| 1470 | <computeroutput>http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html |
| 1471 | </computeroutput> |
| 1472 | and <computeroutput>http://lkml.org/lkml/2007/10/24/673</computeroutput>. |
| 1473 | </para> |
| 1474 | </listitem> |
| 1475 | <listitem><para>Don't update the lock-order graph, and don't check |
njn | f6e8ca9 | 2009-08-07 02:18:00 +0000 | [diff] [blame] | 1476 | for errors, when a "try"-style lock operation happens (e.g. |
| 1477 | <function>pthread_mutex_trylock</function>). Such calls do not add any real |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1478 | restrictions to the locking order, since they can always fail to |
| 1479 | acquire the lock, resulting in the caller going off and doing Plan |
| 1480 | B (presumably it will have a Plan B). Doing such checks could |
| 1481 | generate false lock-order errors and confuse users.</para> |
| 1482 | </listitem> |
| 1483 | <listitem><para> Performance can be very poor. Slowdowns on the |
sewardj | c6a1cd1 | 2008-12-22 00:39:41 +0000 | [diff] [blame] | 1484 | order of 100:1 are not unusual. There is limited scope for |
| 1485 | performance improvements. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1486 | </para> |
| 1487 | </listitem> |
| 1488 | |
| 1489 | </itemizedlist> |
| 1490 | |
| 1491 | </sect1> |
| 1492 | |
| 1493 | </chapter> |