sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1 | <?xml version="1.0"?> <!-- -*- sgml -*- --> |
| 2 | <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
| 3 | "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> |
| 4 | |
| 5 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 6 | <chapter id="hg-manual" xreflabel="Helgrind: thread error detector"> |
| 7 | <title>Helgrind: a thread error detector</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 8 | |
| 9 | <para>To use this tool, you must specify |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 10 | <computeroutput>--tool=helgrind</computeroutput> on the Valgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 11 | command line.</para> |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 16 | <sect1 id="hg-manual.overview" xreflabel="Overview"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 17 | <title>Overview</title> |
| 18 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 19 | <para>Helgrind is a Valgrind tool for detecting synchronisation errors |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 20 | in C, C++ and Fortran programs that use the POSIX pthreads |
| 21 | threading primitives.</para> |
| 22 | |
| 23 | <para>The main abstractions in POSIX pthreads are: a set of threads |
| 24 | sharing a common address space, thread creation, thread joinage, |
| 25 | thread exit, mutexes (locks), condition variables (inter-thread event |
| 26 | notifications), reader-writer locks, and semaphores.</para> |
| 27 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 28 | <para>Helgrind is aware of all these abstractions and tracks their |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 29 | effects as accurately as it can. Currently it does not correctly |
| 30 | handle pthread barriers and pthread spinlocks, although it will not |
| 31 | object if you use them. On x86 and amd64 platforms, it understands |
| 32 | and partially handles implicit locking arising from the use of the |
| 33 | LOCK instruction prefix. |
| 34 | </para> |
| 35 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 36 | <para>Helgrind can detect three classes of errors, which are discussed |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 37 | in detail in the next three sections:</para> |
| 38 | |
| 39 | <orderedlist> |
| 40 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 41 | <para><link linkend="hg-manual.api-checks"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 42 | Misuses of the POSIX pthreads API.</link></para> |
| 43 | </listitem> |
| 44 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 45 | <para><link linkend="hg-manual.lock-orders"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 46 | Potential deadlocks arising from lock |
| 47 | ordering problems.</link></para> |
| 48 | </listitem> |
| 49 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 50 | <para><link linkend="hg-manual.data-races"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 51 | Data races -- accessing memory without adequate locking. |
| 52 | </link></para> |
| 53 | </listitem> |
| 54 | </orderedlist> |
| 55 | |
| 56 | <para>Following those is a section containing |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 57 | <link linkend="hg-manual.effective-use"> |
| 58 | hints and tips on how to get the best out of Helgrind.</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 59 | </para> |
| 60 | |
| 61 | <para>Then there is a |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 62 | <link linkend="hg-manual.options">summary of command-line |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 63 | options.</link> |
| 64 | </para> |
| 65 | |
| 66 | <para>Finally, there is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 67 | <link linkend="hg-manual.todolist">a brief summary of areas in which Helgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 68 | could be improved.</link> |
| 69 | </para> |
| 70 | |
| 71 | </sect1> |
| 72 | |
| 73 | |
| 74 | |
| 75 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 76 | <sect1 id="hg-manual.api-checks" xreflabel="API Checks"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 77 | <title>Detected errors: Misuses of the POSIX pthreads API</title> |
| 78 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 79 | <para>Helgrind intercepts calls to many POSIX pthreads functions, and |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 80 | is therefore able to report on various common problems. Although |
| 81 | these are unglamourous errors, their presence can lead to undefined |
| 82 | program behaviour and hard-to-find bugs later in execution. The |
| 83 | detected errors are:</para> |
| 84 | |
| 85 | <itemizedlist> |
| 86 | <listitem><para>unlocking an invalid mutex</para></listitem> |
| 87 | <listitem><para>unlocking a not-locked mutex</para></listitem> |
| 88 | <listitem><para>unlocking a mutex held by a different |
| 89 | thread</para></listitem> |
| 90 | <listitem><para>destroying an invalid or a locked mutex</para></listitem> |
| 91 | <listitem><para>recursively locking a non-recursive mutex</para></listitem> |
| 92 | <listitem><para>deallocation of memory that contains a |
| 93 | locked mutex</para></listitem> |
| 94 | <listitem><para>passing mutex arguments to functions expecting |
| 95 | reader-writer lock arguments, and vice |
| 96 | versa</para></listitem> |
| 97 | <listitem><para>when a POSIX pthread function fails with an |
| 98 | error code that must be handled</para></listitem> |
| 99 | <listitem><para>when a thread exits whilst still holding locked |
| 100 | locks</para></listitem> |
| 101 | <listitem><para>calling <computeroutput>pthread_cond_wait</computeroutput> |
| 102 | with a not-locked mutex, or one locked by a different |
| 103 | thread</para></listitem> |
| 104 | </itemizedlist> |
| 105 | |
| 106 | <para>Checks pertaining to the validity of mutexes are generally also |
| 107 | performed for reader-writer locks.</para> |
| 108 | |
| 109 | <para>Various kinds of this-can't-possibly-happen events are also |
| 110 | reported. These usually indicate bugs in the system threading |
| 111 | library.</para> |
| 112 | |
| 113 | <para>Reported errors always contain a primary stack trace indicating |
| 114 | where the error was detected. They may also contain auxiliary stack |
| 115 | traces giving additional information. In particular, most errors |
| 116 | relating to mutexes will also tell you where that mutex first came to |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 117 | Helgrind's attention (the "<computeroutput>was first observed |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 118 | at</computeroutput>" part), so you have a chance of figuring out which |
| 119 | mutex it is referring to. For example:</para> |
| 120 | |
| 121 | <programlisting><![CDATA[ |
| 122 | Thread #1 unlocked a not-locked lock at 0x7FEFFFA90 |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 123 | at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.c:492) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 124 | by 0x40073A: nearly_main (tc09_bad_unlock.c:27) |
| 125 | by 0x40079B: main (tc09_bad_unlock.c:50) |
| 126 | Lock at 0x7FEFFFA90 was first observed |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 127 | at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 128 | by 0x40071F: nearly_main (tc09_bad_unlock.c:23) |
| 129 | by 0x40079B: main (tc09_bad_unlock.c:50) |
| 130 | ]]></programlisting> |
| 131 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 132 | <para>Helgrind has a way of summarising thread identities, as |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 133 | evidenced here by the text "<computeroutput>Thread |
| 134 | #1</computeroutput>". This is so that it can speak about threads and |
| 135 | sets of threads without overwhelming you with details. See |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 136 | <link linkend="hg-manual.data-races.errmsgs">below</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 137 | for more information on interpreting error messages.</para> |
| 138 | |
| 139 | </sect1> |
| 140 | |
| 141 | |
| 142 | |
| 143 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 144 | <sect1 id="hg-manual.lock-orders" xreflabel="Lock Orders"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 145 | <title>Detected errors: Inconsistent Lock Orderings</title> |
| 146 | |
| 147 | <para>In this section, and in general, to "acquire" a lock simply |
| 148 | means to lock that lock, and to "release" a lock means to unlock |
| 149 | it.</para> |
| 150 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 151 | <para>Helgrind monitors the order in which threads acquire locks. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 152 | This allows it to detect potential deadlocks which could arise from |
| 153 | the formation of cycles of locks. Detecting such inconsistencies is |
| 154 | useful because, whilst actual deadlocks are fairly obvious, potential |
| 155 | deadlocks may never be discovered during testing and could later lead |
| 156 | to hard-to-diagnose in-service failures.</para> |
| 157 | |
| 158 | <para>The simplest example of such a problem is as |
| 159 | follows.</para> |
| 160 | |
| 161 | <itemizedlist> |
| 162 | <listitem><para>Imagine some shared resource R, which, for whatever |
| 163 | reason, is guarded by two locks, L1 and L2, which must both be held |
| 164 | when R is accessed.</para> |
| 165 | </listitem> |
| 166 | <listitem><para>Suppose a thread acquires L1, then L2, and proceeds |
| 167 | to access R. The implication of this is that all threads in the |
| 168 | program must acquire the two locks in the order first L1 then L2. |
| 169 | Not doing so risks deadlock.</para> |
| 170 | </listitem> |
| 171 | <listitem><para>The deadlock could happen if two threads -- call them |
| 172 | T1 and T2 -- both want to access R. Suppose T1 acquires L1 first, |
| 173 | and T2 acquires L2 first. Then T1 tries to acquire L2, and T2 tries |
| 174 | to acquire L1, but those locks are both already held. So T1 and T2 |
| 175 | become deadlocked.</para> |
| 176 | </listitem> |
| 177 | </itemizedlist> |
| 178 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 179 | <para>Helgrind builds a directed graph indicating the order in which |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 180 | locks have been acquired in the past. When a thread acquires a new |
| 181 | lock, the graph is updated, and then checked to see if it now contains |
| 182 | a cycle. The presence of a cycle indicates a potential deadlock involving |
| 183 | the locks in the cycle.</para> |
| 184 | |
| 185 | <para>In simple situations, where the cycle only contains two locks, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 186 | Helgrind will show where the required order was established:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 187 | |
| 188 | <programlisting><![CDATA[ |
| 189 | Thread #1: lock order "0x7FEFFFAB0 before 0x7FEFFFA80" violated |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 190 | at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 191 | by 0x40081F: main (tc13_laog1.c:24) |
| 192 | Required order was established by acquisition of lock at 0x7FEFFFAB0 |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 193 | at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 194 | by 0x400748: main (tc13_laog1.c:17) |
| 195 | followed by a later acquisition of lock at 0x7FEFFFA80 |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 196 | at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 197 | by 0x400773: main (tc13_laog1.c:18) |
| 198 | ]]></programlisting> |
| 199 | |
| 200 | <para>When there are more than two locks in the cycle, the error is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 201 | equally serious. However, at present Helgrind does not show the locks |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 202 | involved, so as to avoid flooding you with information. That could be |
| 203 | fixed in future. For example, here is a an example involving a cycle |
| 204 | of five locks from a naive implementation the famous Dining |
| 205 | Philosophers problem |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 206 | (see <computeroutput>helgrind/tests/tc14_laog_dinphils.c</computeroutput>). |
| 207 | In this case Helgrind has detected that all 5 philosophers could |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 208 | simultaneously pick up their left fork and then deadlock whilst |
| 209 | waiting to pick up their right forks.</para> |
| 210 | |
| 211 | <programlisting><![CDATA[ |
| 212 | Thread #6: lock order "0x6010C0 before 0x601160" violated |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 213 | at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 214 | by 0x4007C0: dine (tc14_laog_dinphils.c:19) |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 215 | by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 216 | by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so) |
| 217 | by 0x51054CC: clone (in /lib64/libc-2.5.so) |
| 218 | ]]></programlisting> |
| 219 | |
| 220 | </sect1> |
| 221 | |
| 222 | |
| 223 | |
| 224 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 225 | <sect1 id="hg-manual.data-races" xreflabel="Data Races"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 226 | <title>Detected errors: Data Races</title> |
| 227 | |
| 228 | <para>A data race happens, or could happen, when two threads |
| 229 | access a shared memory location without using suitable locks to |
| 230 | ensure single-threaded access. Such missing locking can cause |
| 231 | obscure timing dependent bugs. Ensuring programs are race-free is |
| 232 | one of the central difficulties of threaded programming.</para> |
| 233 | |
| 234 | <para>Reliably detecting races is a difficult problem, and most |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 235 | of Helgrind's internals are devoted to do dealing with it. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 236 | As a consequence this section is somewhat long and involved. |
| 237 | We begin with a simple example.</para> |
| 238 | |
| 239 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 240 | <sect2 id="hg-manual.data-races.example" xreflabel="Simple Race"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 241 | <title>A Simple Data Race</title> |
| 242 | |
| 243 | <para>About the simplest possible example of a race is as follows. In |
| 244 | this program, it is impossible to know what the value |
| 245 | of <computeroutput>var</computeroutput> is at the end of the program. |
| 246 | Is it 2 ? Or 1 ?</para> |
| 247 | |
| 248 | <programlisting><![CDATA[ |
| 249 | #include <pthread.h> |
| 250 | |
| 251 | int var = 0; |
| 252 | |
| 253 | void* child_fn ( void* arg ) { |
| 254 | var++; /* Unprotected relative to parent */ /* this is line 6 */ |
| 255 | return NULL; |
| 256 | } |
| 257 | |
| 258 | int main ( void ) { |
| 259 | pthread_t child; |
| 260 | pthread_create(&child, NULL, child_fn, NULL); |
| 261 | var++; /* Unprotected relative to child */ /* this is line 13 */ |
| 262 | pthread_join(child, NULL); |
| 263 | return 0; |
| 264 | } |
| 265 | ]]></programlisting> |
| 266 | |
| 267 | <para>The problem is there is nothing to |
| 268 | stop <computeroutput>var</computeroutput> being updated simultaneously |
| 269 | by both threads. A correct program would |
| 270 | protect <computeroutput>var</computeroutput> with a lock of type |
| 271 | <computeroutput>pthread_mutex_t</computeroutput>, which is acquired |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 272 | before each access and released afterwards. Helgrind's output for |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 273 | this program is:</para> |
| 274 | |
| 275 | <programlisting><![CDATA[ |
| 276 | Thread #1 is the program's root thread |
| 277 | |
| 278 | Thread #2 was created |
| 279 | at 0x510548E: clone (in /lib64/libc-2.5.so) |
| 280 | by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so) |
| 281 | by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so) |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 282 | by 0x4C23870: pthread_create@* (hg_intercepts.c:198) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 283 | by 0x4005F1: main (simple_race.c:12) |
| 284 | |
| 285 | Possible data race during write of size 4 at 0x601034 |
| 286 | at 0x4005F2: main (simple_race.c:13) |
| 287 | Old state: shared-readonly by threads #1, #2 |
| 288 | New state: shared-modified by threads #1, #2 |
| 289 | Reason: this thread, #1, holds no consistent locks |
| 290 | Location 0x601034 has never been protected by any lock |
| 291 | ]]></programlisting> |
| 292 | |
| 293 | <para>This is quite a lot of detail for an apparently simple error. |
| 294 | The last clause is the main error message. It says there is a race as |
| 295 | a result of a write of size 4 (bytes), at 0x601034, which is |
| 296 | presumably the address of <computeroutput>var</computeroutput>, |
| 297 | happening in function <computeroutput>main</computeroutput> at line 13 |
| 298 | in the program.</para> |
| 299 | |
| 300 | <para>Note that it is purely by chance that the race is |
| 301 | reported for the parent thread's access. It could equally have been |
| 302 | reported instead for the child's access, at line 6. The error will |
| 303 | only be reported for one of the locations, since neither the parent |
| 304 | nor child is, by itself, incorrect. It is only when both access |
| 305 | <computeroutput>var</computeroutput> without a lock that an error |
| 306 | exists.</para> |
| 307 | |
| 308 | <para>The error message shows some other interesting details. The |
| 309 | sections below explain them. Here we merely note their presence:</para> |
| 310 | |
| 311 | <itemizedlist> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 312 | <listitem><para>Helgrind maintains some kind of state machine for the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 313 | memory location in question, hence the "<computeroutput>Old |
| 314 | state:</computeroutput>" and "<computeroutput>New |
| 315 | state:</computeroutput>" lines.</para> |
| 316 | </listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 317 | <listitem><para>Helgrind keeps track of which threads have accessed |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 318 | the location: "<computeroutput>threads #1, #2</computeroutput>". |
| 319 | Before printing the main error message, it prints the creation |
| 320 | points of these two threads, so you can see which threads it is |
| 321 | referring to.</para> |
| 322 | </listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 323 | <listitem><para>Helgrind tries to provide an explaination of why the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 324 | race exists: "<computeroutput>Location 0x601034 has never been |
| 325 | protected by any lock</computeroutput>".</para> |
| 326 | </listitem> |
| 327 | </itemizedlist> |
| 328 | |
| 329 | <para>Understanding the memory state machine is central to |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 330 | understanding Helgrind's race-detection algorithm. The next three |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 331 | subsections explain this.</para> |
| 332 | |
| 333 | </sect2> |
| 334 | |
| 335 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 336 | <sect2 id="hg-manual.data-races.memstates" xreflabel="Memory States"> |
| 337 | <title>Helgrind's Memory State Machine</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 338 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 339 | <para>Helgrind tracks the state of every byte of memory used by your |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 340 | program. There are a number of states, but only three are |
| 341 | interesting:</para> |
| 342 | |
| 343 | <itemizedlist> |
| 344 | <listitem><para>Exclusive: memory in this state is regarded as owned |
| 345 | exclusively by one particular thread. That thread may read and |
| 346 | write it without a lock. Even in highly threaded programs, the |
| 347 | majority of locations never leave the Exclusive state, since most |
| 348 | data is thread-private.</para> |
| 349 | </listitem> |
| 350 | <listitem><para>Shared-Readonly: memory in this state is regarded as |
| 351 | shared by multiple threads. In this state, any thread may read the |
| 352 | memory without a lock, reflecting the fact that readonly data may |
| 353 | safely be shared between threads without locking.</para> |
| 354 | </listitem> |
| 355 | <listitem><para>Shared-Modified: memory in this state is regarded as |
| 356 | shared by multiple threads, at least one of which has written to it. |
| 357 | All participating threads must hold at least one lock in common when |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 358 | accessing the memory. If no such lock exists, Helgrind reports a |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 359 | race error.</para> |
| 360 | </listitem> |
| 361 | </itemizedlist> |
| 362 | |
| 363 | <para>Let's review the simple example above with this in mind. When |
| 364 | the program starts, <computeroutput>var</computeroutput> is not in any |
| 365 | of these states. Either the parent or child thread gets to its |
| 366 | <computeroutput>var++</computeroutput> first, and thereby |
| 367 | thereby gets Exclusive ownership of the location.</para> |
| 368 | |
| 369 | <para>The later-running thread now arrives at |
| 370 | its <computeroutput>var++</computeroutput> statement. It first reads |
| 371 | the existing value from memory. |
| 372 | Because <computeroutput>var</computeroutput> is currently marked as |
| 373 | owned exclusively by the other thread, its state is changed to |
| 374 | shared-readonly by both threads.</para> |
| 375 | |
| 376 | <para>This same thread adds one to the value it has and stores it back |
| 377 | in <computeroutput>var</computeroutput>. This causes another state |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 378 | change, this time to the shared-modified state. Because Helgrind has |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 379 | also been tracking which threads hold which locks, it can see that |
| 380 | <computeroutput>var</computeroutput> is in shared-modified state but |
| 381 | no lock has been used to consistently protect it. Hence a race is |
| 382 | reported exactly at the transition from shared-readonly to |
| 383 | shared-modified.</para> |
| 384 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 385 | <para>The essence of the algorithm is this. Helgrind keeps track of |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 386 | each memory location that has been accessed by more than one thread. |
| 387 | For each such location it incrementally infers the set of locks which |
| 388 | have consistently been used to protect that location. If the |
| 389 | location's lockset becomes empty, and at some point one of the threads |
| 390 | attempts to write to it, a race is then reported.</para> |
| 391 | |
| 392 | <para>This technique is known as "lockset inference" and was |
| 393 | introduced in: "Eraser: A Dynamic Data Race Detector for Multithreaded |
| 394 | Programs" (Stefan Savage, Michael Burrows, Greg Nelson, Patrick |
| 395 | Sobalvarro and Thomas Anderson, ACM Transactions on Computer Systems, |
| 396 | 15(4):391-411, November 1997).</para> |
| 397 | |
| 398 | <para>Lockset inference has since been widely implemented, studied and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 399 | extended. Helgrind incorporates several refinements aimed at avoiding |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 400 | the high false error rate that naive versions of the algorithm suffer |
| 401 | from. A |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 402 | <link linkend="hg-manual.data-races.summary">summary of the complete |
| 403 | algorithm used by Helgrind</link> is presented below. First, however, |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 404 | it is important to understand details of transitions pertaining to the |
| 405 | Exclusive-ownership state.</para> |
| 406 | |
| 407 | </sect2> |
| 408 | |
| 409 | |
| 410 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 411 | <sect2 id="hg-manual.data-races.exclusive" xreflabel="Excl Transfers"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 412 | <title>Transfers of Exclusive Ownership Between Threads</title> |
| 413 | |
| 414 | <para>As presented, the algorithm is far too strict. It reports many |
| 415 | errors in perfectly correct, widely used parallel programming |
| 416 | constructions, for example, using child worker threads and worker |
| 417 | thread pools.</para> |
| 418 | |
| 419 | <para>To avoid these false errors, we must refine the algorithm so |
| 420 | that it keeps memory in an Exclusive ownership state in cases where it |
| 421 | would otherwise decay into a shared-readonly or shared-modified state. |
| 422 | Recall that Exclusive ownership is special in that it grants the |
| 423 | owning thread the right to access memory without use of any locks. In |
| 424 | order to support worker-thread and worker-thread-pool idioms, we will |
| 425 | allow threads to steal exclusive ownership of memory from other |
| 426 | threads under certain circumstances.</para> |
| 427 | |
| 428 | <para>Here's an example. Imagine a parent thread creates child |
| 429 | threads to do units of work. For each unit of work, the parent |
| 430 | allocates a work buffer, fills it in, and creates the child thread, |
| 431 | handing it a pointer to the buffer. The child reads/writes the buffer |
| 432 | and eventually exits, and the waiting parent then extracts the results |
| 433 | from the buffer:</para> |
| 434 | |
| 435 | <programlisting><![CDATA[ |
| 436 | typedef ... Buffer; |
| 437 | |
| 438 | pthread_t child; |
| 439 | Buffer buf; |
| 440 | |
| 441 | /* ---- Parent ---- */ /* ---- Child ---- */ |
| 442 | |
| 443 | /* parent writes workload into buf */ |
| 444 | pthread_create( &child, child_fn, &buf ); |
| 445 | |
| 446 | /* parent does not read */ void child_fn ( Buffer* buf ) { |
| 447 | /* or write buf */ /* read/write buf */ |
| 448 | } |
| 449 | |
| 450 | pthread_join ( child ); |
| 451 | /* parent reads results from buf */ |
| 452 | ]]></programlisting> |
| 453 | |
| 454 | <para>Although <computeroutput>buf</computeroutput> is accessed by |
| 455 | both threads, neither uses locks, yet the program is race-free. The |
| 456 | essential observation is that the child's creation and exit create |
| 457 | synchronisation events between it and the parent. These force the |
| 458 | child's accesses to <computeroutput>buf</computeroutput> to happen |
| 459 | after the parent initialises <computeroutput>buf</computeroutput>, and |
| 460 | before the parent reads the results |
| 461 | from <computeroutput>buf</computeroutput>.</para> |
| 462 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 463 | <para>To model this, Helgrind allows the child to steal, from the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 464 | parent, exclusive ownership of any memory exclusively owned by the |
| 465 | parent before the pthread_create call. Similarly, once the parent's |
| 466 | pthread_join call returns, it can steal back ownership of memory |
| 467 | exclusively owned by the child. In this way ownership |
| 468 | of <computeroutput>buf</computeroutput> is transferred from parent to |
| 469 | child and back, so the basic algorithm does not report any races |
| 470 | despite the absence of any locking.</para> |
| 471 | |
| 472 | <para>Note that the child may only steal memory owned by the parent |
| 473 | prior to the pthread_create call. If the child attempts to read or |
| 474 | write memory which is also accessed by the parent in between the |
| 475 | pthread_create and pthread_join calls, an error is still |
| 476 | reported.</para> |
| 477 | |
| 478 | <para>This technique was introduced with the name "thread lifetime |
| 479 | segments" in "Runtime Checking of Multithreaded Applications with |
| 480 | Visual Threads" (Jerry J. Harrow, Jr, Proceedings of the 7th |
| 481 | International SPIN Workshop on Model Checking of Software Stanford, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 482 | California, USA, August 2000, LNCS 1885, pp331--342). Helgrind |
| 483 | implements an extended version of it. Specifically, Helgrind allows |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 484 | transfer of exclusive ownership in the following situations:</para> |
| 485 | |
| 486 | <itemizedlist> |
| 487 | <listitem><para>At thread creation: a child can acquire ownership of |
| 488 | memory held exclusively by the parent prior to the child's |
| 489 | creation.</para> |
| 490 | </listitem> |
| 491 | <listitem><para>At thread joining: the joiner (thread not exiting) |
| 492 | can acquire ownership of memory held exclusively by the joinee |
| 493 | (thread that is exiting) at the point it exited.</para> |
| 494 | </listitem> |
| 495 | <listitem><para>At condition variable signallings and broadcasts. A |
| 496 | thread Tw which completes a pthread_cond_wait call as a result of |
| 497 | a signal or broadcast on the same condition variable by some other |
| 498 | thread Ts, may acquire ownership of memory held exclusively by |
| 499 | Ts prior to the pthread_cond_signal/broadcast |
| 500 | call.</para> |
| 501 | </listitem> |
| 502 | <listitem><para>At semaphore posts (sem_post) calls. A thread Tw |
| 503 | which completes a sem_wait call call as a result of a sem_post call |
| 504 | on the same semaphore by some other thread Tp, may acquire |
| 505 | ownership of memory held exclusively by Tp prior to the sem_post |
| 506 | call.</para> |
| 507 | </listitem> |
| 508 | </itemizedlist> |
| 509 | |
| 510 | </sect2> |
| 511 | |
| 512 | |
| 513 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 514 | <sect2 id="hg-manual.data-races.re-excl" xreflabel="Re-Excl Transfers"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 515 | <title>Restoration of Exclusive Ownership</title> |
| 516 | |
| 517 | <para>Another common idiom is to partition the lifetime of the program |
| 518 | as a whole into several distinct phases. In some of those phases, a |
| 519 | memory location may be accessed by multiple threads and so require |
| 520 | locking. In other phases only one thread exists and so can access the |
| 521 | memory without locking. For example:</para> |
| 522 | |
| 523 | <programlisting><![CDATA[ |
| 524 | int var = 0; /* shared variable */ |
| 525 | pthread_mutex_t mx = PTHREAD_MUTEX_INITIALIZER; /* guard for var */ |
| 526 | pthread_t child; |
| 527 | |
| 528 | /* ---- Parent ---- */ /* ---- Child ---- */ |
| 529 | |
| 530 | var += 1; /* no lock used */ |
| 531 | |
| 532 | pthread_create( &child, child_fn, NULL ); |
| 533 | |
| 534 | void child_fn ( void* uu ) { |
| 535 | pthread_mutex_lock(&mx); pthread_mutex_lock(&mx); |
| 536 | var += 2; var += 3; |
| 537 | pthread_mutex_unlock(&mx); pthread_mutex_unlock(&mx); |
| 538 | } |
| 539 | |
| 540 | pthread_join ( child ); |
| 541 | |
| 542 | var += 4; /* no lock used */ |
| 543 | ]]></programlisting> |
| 544 | |
| 545 | <para>This program is correct, but using only the mechanisms described |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 546 | so far, Helgrind would report an error at |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 547 | <computeroutput>var += 4</computeroutput>. This is because, by that |
| 548 | point, <computeroutput>var</computeroutput> is marked as being in the |
| 549 | state "shared-modified and protected by the |
| 550 | lock <computeroutput>mx</computeroutput>", but is being accessed |
| 551 | without locking. Really, what we want is |
| 552 | for <computeroutput>var</computeroutput> to return to the parent |
| 553 | thread's exclusive ownership after the child thread has exited.</para> |
| 554 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 555 | <para>To make this possible, for every memory location Helgrind also keeps |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 556 | track of all the threads that have accessed that location |
| 557 | -- its threadset. When a thread Tquitter joins back to Tstayer, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 558 | Helgrind examines the locksets of all memory in shared-modified or |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 559 | shared-readable state. In each such lockset, if Tquitter is |
| 560 | mentioned, it is removed and replaced by Tstayer. If, as a result, a |
| 561 | lockset becomes a singleton set containing Tstayer, then the |
| 562 | location's state is changed to belongs-exclusively-to-Tstayer.</para> |
| 563 | |
| 564 | <para>In our example, the result is exactly as we desire: |
| 565 | <computeroutput>var</computeroutput> is reacquired exclusively by the |
| 566 | parent after the child exits.</para> |
| 567 | |
| 568 | <para>More generally, when a group of threads merges back to a single |
| 569 | thread via a cascade of pthread_join calls, any memory shared by the |
| 570 | group (or a subset of it) ends up being owned exclusively by the sole |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 571 | surviving thread. This significantly enhances Helgrind's flexibility, |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 572 | since it means that each memory location may make arbitrarily many |
| 573 | transitions between exclusive and shared ownership. Furthermore, a |
| 574 | different lock may protect the location during each period of shared |
| 575 | ownership.</para> |
| 576 | |
| 577 | </sect2> |
| 578 | |
| 579 | |
| 580 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 581 | <sect2 id="hg-manual.data-races.summary" xreflabel="Race Det Summary"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 582 | <title>A Summary of the Race Detection Algorithm</title> |
| 583 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 584 | <para>Helgrind looks for memory locations which are accessed by more |
| 585 | than one thread. For each such location, Helgrind records which of |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 586 | the program's locks were held by the accessing thread at the time of |
| 587 | each access. The hope is to discover that there is indeed at least |
| 588 | one lock which is consistently used by all threads to protect that |
| 589 | location. If no such lock can be found, then there is apparently no |
| 590 | consistent locking strategy being applied for that location, and so a |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 591 | possible data race might result. Helgrind accordingly reports an |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 592 | error.</para> |
| 593 | |
| 594 | <para>In practice this discipline is far too simplistic, and is |
| 595 | unusable since it reports many races in some widely used and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 596 | known-correct programming disciplines. Helgrind's checking therefore |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 597 | incorporates many refinements to this basic idea, and can be |
| 598 | summarised as follows:</para> |
| 599 | |
| 600 | <para>The following thread events are intercepted and monitored:</para> |
| 601 | |
| 602 | <itemizedlist> |
| 603 | <listitem><para>thread creation and exiting (pthread_create, |
| 604 | pthread_join, pthread_exit)</para> |
| 605 | </listitem> |
| 606 | <listitem> |
| 607 | <para>lock acquisition and release (pthread_mutex_lock, |
| 608 | pthread_mutex_unlock, pthread_rwlock_rdlock, |
| 609 | pthread_rwlock_wrlock, |
| 610 | pthread_rwlock_unlock)</para> |
| 611 | </listitem> |
| 612 | <listitem> |
| 613 | <para>inter-thread event notifications (pthread_cond_wait, |
| 614 | pthread_cond_signal, pthread_cond_broadcast, |
| 615 | sem_wait, sem_post)</para> |
| 616 | </listitem> |
| 617 | </itemizedlist> |
| 618 | |
| 619 | <para>Memory allocation and deallocation events are intercepted and |
| 620 | monitored:</para> |
| 621 | |
| 622 | <itemizedlist> |
| 623 | <listitem> |
| 624 | <para>malloc/new/free/delete and variants</para> |
| 625 | </listitem> |
| 626 | <listitem> |
| 627 | <para>stack allocation and deallocation</para> |
| 628 | </listitem> |
| 629 | </itemizedlist> |
| 630 | |
| 631 | <para>All memory accesses are intercepted and monitored.</para> |
| 632 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 633 | <para>By observing the above events, Helgrind can infer certain |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 634 | aspects of the program's locking discipline. Programs which adhere to |
| 635 | the following rules are considered to be acceptable: |
| 636 | </para> |
| 637 | |
| 638 | <itemizedlist> |
| 639 | <listitem> |
| 640 | <para>A thread may allocate memory, and write initial values into |
| 641 | it, without locking. That thread is regarded as owning the memory |
| 642 | exclusively.</para> |
| 643 | </listitem> |
| 644 | <listitem> |
| 645 | <para>A thread may read and write memory which it owns exclusively, |
| 646 | without locking.</para> |
| 647 | </listitem> |
| 648 | <listitem> |
| 649 | <para>Memory which is owned exclusively by one thread may be read by |
| 650 | that thread and others without locking. However, in this situation |
| 651 | no thread may do unlocked writes to the memory (except for the owner |
| 652 | thread's initializing write).</para> |
| 653 | </listitem> |
| 654 | <listitem> |
| 655 | <para>Memory which is shared between multiple threads, one or more |
| 656 | of which writes to it, must be protected by a lock which is |
| 657 | correctly acquired and released by all threads accessing the |
| 658 | memory.</para> |
| 659 | </listitem> |
| 660 | </itemizedlist> |
| 661 | |
| 662 | <para>Any violation of this discipline will cause an error to be reported. |
| 663 | However, two exemptions apply:</para> |
| 664 | |
| 665 | <itemizedlist> |
| 666 | <listitem> |
| 667 | <para>A thread Y can acquire exclusive ownership of memory |
| 668 | previously owned exclusively by a different thread X providing |
| 669 | X's last access and Y's first access are separated by one of the |
| 670 | following synchronization events:</para> |
| 671 | <itemizedlist> |
| 672 | <listitem><para>X creates thread Y</para></listitem> |
| 673 | <listitem><para>X joins back to Y</para></listitem> |
| 674 | <listitem><para>X uses a condition-variable to signal at Y, and Y is |
| 675 | waiting for that event</para></listitem> |
| 676 | <listitem><para>Y completes a semaphore wait as a result of X signalling |
| 677 | on that same semaphore</para></listitem> |
| 678 | </itemizedlist> |
| 679 | <para> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 680 | This refinement allows Helgrind to correctly track the ownership |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 681 | state of inter-thread buffers used in the worker-thread and |
| 682 | worker-thread-pool concurrent programming idioms (styles).</para> |
| 683 | </listitem> |
| 684 | <listitem> |
| 685 | <para>Similarly, if thread Y joins back to thread X, memory |
| 686 | exclusively owned by Y becomes exclusively owned by X instead. |
| 687 | Also, memory that has been shared only by X and Y becomes |
| 688 | exclusively owned by X. More generally, memory that has been shared |
| 689 | by X, Y and some arbitrary other set S of threads is re-marked as |
| 690 | shared by X and S. Hence, under the right circumstances, memory |
| 691 | shared amongst multiple threads, all of which join into just one, |
| 692 | can revert to the exclusive ownership state.</para> |
| 693 | <para> |
| 694 | In effect, each memory location may make arbitrarily many |
| 695 | transitions between exclusive and shared ownership. Furthermore, a |
| 696 | different lock may protect the location during each period of shared |
| 697 | ownership. This significantly enhances the flexibility of the |
| 698 | algorithm.</para> |
| 699 | </listitem> |
| 700 | </itemizedlist> |
| 701 | |
| 702 | <para>The ownership state, accessing thread-set and related lock-set |
| 703 | for each memory location are tracked at 8-bit granularity. This means |
| 704 | the algorithm is precise even for 16- and 8-bit memory |
| 705 | accesses.</para> |
| 706 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 707 | <para>Helgrind correctly handles reader-writer locks in this |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 708 | framework. Locations shared between multiple threads can be protected |
| 709 | during reads by locks held in either read-mode or write-mode, but can |
| 710 | only be protected during writes by locks held in write-mode. Normal |
| 711 | POSIX mutexes are treated as if they are reader-writer locks which are |
| 712 | only ever held in write-mode.</para> |
| 713 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 714 | <para>Helgrind correctly handles POSIX mutexes for which recursive |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 715 | locking is allowed.</para> |
| 716 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 717 | <para>Helgrind partially correctly handles x86 and amd64 memory access |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 718 | instructions preceded by a LOCK prefix. Writes are correctly handled, |
| 719 | by pretending that the LOCK prefix implies acquisition and release of |
| 720 | a magic "bus hardware lock" mutex before and after the instruction. |
| 721 | This unfortunately requires subsequent reads from such locations to |
| 722 | also use a LOCK prefix, which is not required by the real hardware. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 723 | Helgrind does not offer any equivalent handling for atomic sequences |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 724 | on PowerPC/POWER platforms created by the use of lwarx/stwcx |
| 725 | instructions.</para> |
| 726 | |
| 727 | </sect2> |
| 728 | |
| 729 | |
| 730 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 731 | <sect2 id="hg-manual.data-races.errmsgs" xreflabel="Race Error Messages"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 732 | <title>Interpreting Race Error Messages</title> |
| 733 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 734 | <para>Helgrind's race detection algorithm collects a lot of |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 735 | information, and tries to present it in a helpful way when a race is |
| 736 | detected. Here's an example:</para> |
| 737 | |
| 738 | <programlisting><![CDATA[ |
| 739 | Thread #2 was created |
| 740 | at 0x510548E: clone (in /lib64/libc-2.5.so) |
| 741 | by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so) |
| 742 | by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so) |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 743 | by 0x4C23870: pthread_create@* (hg_intercepts.c:198) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 744 | by 0x400CEF: main (tc17_sembar.c:195) |
| 745 | |
| 746 | // And the same for threads #3, #4 and #5 -- omitted for conciseness |
| 747 | |
| 748 | Possible data race during read of size 4 at 0x602174 |
| 749 | at 0x400BE5: gomp_barrier_wait (tc17_sembar.c:122) |
| 750 | by 0x400C44: child (tc17_sembar.c:161) |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 751 | by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 752 | by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so) |
| 753 | by 0x51054CC: clone (in /lib64/libc-2.5.so) |
| 754 | Old state: shared-modified by threads #2, #3, #4, #5 |
| 755 | New state: shared-modified by threads #2, #3, #4, #5 |
| 756 | Reason: this thread, #2, holds no consistent locks |
| 757 | Last consistently used lock for 0x602174 was first observed |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 758 | at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 759 | by 0x4009E4: gomp_barrier_init (tc17_sembar.c:46) |
| 760 | by 0x400CBC: main (tc17_sembar.c:192) |
| 761 | ]]></programlisting> |
| 762 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 763 | <para>Helgrind first announces the creation points of any threads |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 764 | referenced in the error message. This is so it can speak concisely |
| 765 | about threads and sets of threads without repeatedly printing their |
| 766 | creation point call stacks. Each thread is only ever announced once, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 767 | the first time it appears in any Helgrind error message.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 768 | |
| 769 | <para>The main error message begins at the text |
| 770 | "<computeroutput>Possible data race during read</computeroutput>". |
| 771 | At the start is information you would expect to see -- address and |
| 772 | size of the racing access, whether a read or a write, and the call |
| 773 | stack at the point it was detected.</para> |
| 774 | |
| 775 | <para>More interesting is the state transition caused by this access. |
| 776 | This memory is already in the shared-modified state, and up to now has |
| 777 | been consistently protected by at least one lock. However, the thread |
| 778 | making the access in question (thread #2, here) does not hold any |
| 779 | locks in common with those held during all previous accesses to the |
| 780 | location -- "no consistent locks", in other words.</para> |
| 781 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 782 | <para>Finally, Helgrind shows the lock which has protected this |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 783 | location in all previous accesses. (If there is more than one, only |
| 784 | one is shown). This can be a useful hint, because it typically shows |
| 785 | the lock that the programmers intended to use to protect the location, |
| 786 | but in this case forgot.</para> |
| 787 | |
| 788 | <para>Here are some more examples of race reports. This not an |
| 789 | exhaustive list of combinations, but should give you some insight into |
| 790 | how to interpret the output.</para> |
| 791 | |
| 792 | <programlisting><![CDATA[ |
| 793 | Possible data race during write ... |
| 794 | Old state: shared-readonly by threads #1, #2, #3 |
| 795 | New state: shared-modified by threads #1, #2, #3 |
| 796 | Reason: this thread, #3, holds no consistent locks |
| 797 | Location ... has never been protected by any lock |
| 798 | ]]></programlisting> |
| 799 | |
| 800 | <para>The location is shared by 3 threads, all of which have been |
| 801 | reading it without locking ("has never been protected by any lock"). |
| 802 | Now one of them is writing it. Regardless of whether the writer has a |
| 803 | lock or not, this is still an error, because the write races against |
| 804 | the previously observed reads.</para> |
| 805 | |
| 806 | <programlisting><![CDATA[ |
| 807 | Possible data race during read ... |
| 808 | Old state: shared-modified by threads #1, #2, #3 |
| 809 | New state: shared-modified by threads #1, #2, #3 |
| 810 | Reason: this thread, #3, holds no consistent locks |
| 811 | Last consistently used lock for ... was first observed ... |
| 812 | ]]></programlisting> |
| 813 | |
| 814 | <para>The location is shared by 3 threads, all of which have been |
| 815 | reading and writing it while (as required) holding at least one lock |
| 816 | in common. Now it is being read without that lock being held. In the |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 817 | "Last consistently used lock" part, Helgrind offers its best guess as |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 818 | to the identity of the lock that should have been used.</para> |
| 819 | |
| 820 | <programlisting><![CDATA[ |
| 821 | Possible data race during write ... |
| 822 | Old state: owned exclusively by thread #4 |
| 823 | New state: shared-modified by threads #4, #5 |
| 824 | Reason: this thread, #5, holds no locks at all |
| 825 | ]]></programlisting> |
| 826 | |
| 827 | <para>A location that has so far been accessed exclusively by thread |
| 828 | #4 has now been written by thread #5, without use of any lock. This |
| 829 | can be a sign that the programmer did not consider the possibility of |
| 830 | the location being shared between threads, or, alternatively, forgot |
| 831 | to use the appropriate lock.</para> |
| 832 | |
| 833 | <para>Note that thread #4 exclusively owns the location, and so has |
| 834 | the right to access it without holding a lock. However, this message |
| 835 | does not say that thread #4 is not using a lock for this location. |
| 836 | Indeed, it could be using a lock for the location because it intends |
| 837 | to make it available to other threads, one of which is thread #5 -- |
| 838 | and thread #5 has forgotten to use the lock.</para> |
| 839 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 840 | <para>Also, this message implies that Helgrind did not see any |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 841 | synchronisation event between threads #4 and #5 that would have |
| 842 | allowed #5 to acquire exclusive ownership from #4. See |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 843 | <link linkend="hg-manual.data-races.exclusive">above</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 844 | for a discussion of transfers of exclusive ownership states between |
| 845 | threads.</para> |
| 846 | |
| 847 | </sect2> |
| 848 | |
| 849 | |
| 850 | </sect1> |
| 851 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 852 | <sect1 id="hg-manual.effective-use" xreflabel="Helgrind Effective Use"> |
| 853 | <title>Hints and Tips for Effective Use of Helgrind</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 854 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 855 | <para>Helgrind can be very helpful in finding and resolving |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 856 | threading-related problems. Like all sophisticated tools, it is most |
| 857 | effective when you understand how to play to its strengths.</para> |
| 858 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 859 | <para>Helgrind will be less effective when you merely throw an |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 860 | existing threaded program at it and try to make sense of any reported |
| 861 | errors. It will be more effective if you design threaded programs |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 862 | from the start in a way that helps Helgrind verify correctness. The |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 863 | same is true for finding memory errors with Memcheck, but applies more |
| 864 | here, because thread checking is a harder problem. Consequently it is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 865 | much easier to write a correct program for which Helgrind falsely |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 866 | reports (threading) errors than it is to write a correct program for |
| 867 | which Memcheck falsely reports (memory) errors.</para> |
| 868 | |
| 869 | <para>With that in mind, here are some tips, listed most important first, |
| 870 | for getting reliable results and avoiding false errors. The first two |
| 871 | are critical. Any violations of them will swamp you with huge numbers |
| 872 | of false data-race errors.</para> |
| 873 | |
| 874 | |
| 875 | <orderedlist> |
| 876 | |
| 877 | <listitem> |
| 878 | <para>Make sure your application, and all the libraries it uses, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 879 | use the POSIX threading primitives. Helgrind needs to be able to |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 880 | see all events pertaining to thread creation, exit, locking and |
| 881 | other syncronisation events. To do so it intercepts many POSIX |
| 882 | pthread_ functions.</para> |
| 883 | |
| 884 | <para>Do not roll your own threading primitives (mutexes, etc) |
| 885 | from combinations of the Linux futex syscall, counters and wotnot. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 886 | These throw Helgrind's internal what's-going-on models way off |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 887 | course and will give bogus results.</para> |
| 888 | |
| 889 | <para>Also, do not reimplement existing POSIX abstractions using |
| 890 | other POSIX abstractions. For example, don't build your own |
| 891 | semaphore routines or reader-writer locks from POSIX mutexes and |
| 892 | condition variables. Instead use POSIX reader-writer locks and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 893 | semaphores directly, since Helgrind supports them directly.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 894 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 895 | <para>Helgrind directly supports the following POSIX threading |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 896 | abstractions: mutexes, reader-writer locks, condition variables |
| 897 | (but see below), and semaphores. Currently spinlocks and barriers |
| 898 | are not supported, although they could be in future. A prototype |
| 899 | "safe" implementation of barriers, based on semaphores, is |
| 900 | available: please contact the Valgrind authors for details.</para> |
| 901 | |
| 902 | <para>At the time of writing, the following popular Linux packages |
| 903 | are known to implement their own threading primitives:</para> |
| 904 | |
| 905 | <itemizedlist> |
| 906 | <listitem><para>Qt version 4.X. Qt 3.X is fine, but not 4.X. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 907 | Helgrind contains partial direct support for Qt 4.X threading, |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 908 | but this is not yet in a usable state. Assistance from folks |
| 909 | knowledgeable in Qt 4 threading internals would be |
| 910 | appreciated.</para></listitem> |
| 911 | |
| 912 | <listitem><para>Runtime support library for GNU OpenMP (part of |
| 913 | GCC), at least GCC versions 4.2 and 4.3. With some minor effort |
| 914 | of modifying the GNU OpenMP runtime support sources, it is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 915 | possible to use Helgrind on GNU OpenMP compiled codes. Please |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 916 | contact the Valgrind authors for details.</para></listitem> |
| 917 | </itemizedlist> |
| 918 | </listitem> |
| 919 | |
| 920 | <listitem> |
| 921 | <para>Avoid memory recycling. If you can't avoid it, you must use |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 922 | tell Helgrind what is going on via the VALGRIND_HG_CLEAN_MEMORY |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 923 | client request |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 924 | (in <computeroutput>helgrind.h</computeroutput>).</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 925 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 926 | <para>Helgrind is aware of standard memory allocation and |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 927 | deallocation that occurs via malloc/free/new/delete and from entry |
| 928 | and exit of stack frames. In particular, when memory is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 929 | deallocated via free, delete, or function exit, Helgrind considers |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 930 | that memory clean, so when it is eventually reallocated, its |
| 931 | history is irrelevant.</para> |
| 932 | |
| 933 | <para>However, it is common practice to implement memory recycling |
| 934 | schemes. In these, memory to be freed is not handed to |
| 935 | malloc/delete, but instead put into a pool of free buffers to be |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 936 | handed out again as required. The problem is that Helgrind has no |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 937 | way to know that such memory is logically no longer in use, and |
| 938 | its history is irrelevant. Hence you must make that explicit, |
| 939 | using the VALGRIND_HG_CLEAN_MEMORY client request to specify the |
| 940 | relevant address ranges. It's easiest to put these requests into |
| 941 | the pool manager code, and use them either when memory is returned |
| 942 | to the pool, or is allocated from it.</para> |
| 943 | </listitem> |
| 944 | |
| 945 | <listitem> |
| 946 | <para>Avoid POSIX condition variables. If you can, use POSIX |
| 947 | semaphores (sem_t, sem_post, sem_wait) to do inter-thread event |
| 948 | signalling. Semaphores with an initial value of zero are |
| 949 | particularly useful for this.</para> |
| 950 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 951 | <para>Helgrind only partially correctly handles POSIX condition |
| 952 | variables. This is because Helgrind can see inter-thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 953 | dependencies between a pthread_cond_wait call and a |
| 954 | pthread_cond_signal/broadcast call only if the waiting thread |
| 955 | actually gets to the rendezvous first (so that it actually calls |
| 956 | pthread_cond_wait). It can't see dependencies between the threads |
| 957 | if the signaller arrives first. In the latter case, POSIX |
| 958 | guidelines imply that the associated boolean condition still |
| 959 | provides an inter-thread synchronisation event, but one which is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 960 | invisible to Helgrind.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 961 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 962 | <para>The result of Helgrind missing some inter-thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 963 | synchronisation events is to cause it to report false positives. |
| 964 | That's because missing such events reduces the extent to which it |
| 965 | can transfer exclusive memory ownership between threads. So |
| 966 | memory may end up in a shared-modified state when that was not |
| 967 | intended by the application programmers.</para> |
| 968 | |
| 969 | <para>The root cause of this synchronisation lossage is |
| 970 | particularly hard to understand, so an example is helpful. It was |
| 971 | discussed at length by Arndt Muehlenfeld ("Runtime Race Detection |
| 972 | in Multi-Threaded Programs", Dissertation, TU Graz, Austria). The |
| 973 | canonical POSIX-recommended usage scheme for condition variables |
| 974 | is as follows:</para> |
| 975 | |
| 976 | <programlisting><![CDATA[ |
| 977 | b is a Boolean condition, which is False most of the time |
| 978 | cv is a condition variable |
| 979 | mx is its associated mutex |
| 980 | |
| 981 | Signaller: Waiter: |
| 982 | |
| 983 | lock(mx) lock(mx) |
| 984 | b = True while (b == False) |
| 985 | signal(cv) wait(cv,mx) |
| 986 | unlock(mx) unlock(mx) |
| 987 | ]]></programlisting> |
| 988 | |
| 989 | <para>Assume <computeroutput>b</computeroutput> is False most of |
| 990 | the time. If the waiter arrives at the rendezvous first, it |
| 991 | enters its while-loop, waits for the signaller to signal, and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 992 | eventually proceeds. Helgrind sees the signal, notes the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 993 | dependency, and all is well.</para> |
| 994 | |
| 995 | <para>If the signaller arrives |
| 996 | first, <computeroutput>b</computeroutput> is set to true, and the |
| 997 | signal disappears into nowhere. When the waiter later arrives, it |
| 998 | does not enter its while-loop and simply carries on. But even in |
| 999 | this case, the waiter code following the while-loop cannot execute |
| 1000 | until the signaller sets <computeroutput>b</computeroutput> to |
| 1001 | True. Hence there is still the same inter-thread dependency, but |
| 1002 | this time it is through an arbitrary in-memory condition, and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1003 | Helgrind cannot see it.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1004 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1005 | <para>By comparison, Helgrind's detection of inter-thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1006 | dependencies caused by semaphore operations is believed to be |
| 1007 | exactly correct.</para> |
| 1008 | |
| 1009 | <para>As far as I know, a solution to this problem that does not |
| 1010 | require source-level annotation of condition-variable wait loops |
| 1011 | is beyond the current state of the art.</para> |
| 1012 | </listitem> |
| 1013 | |
| 1014 | <listitem> |
| 1015 | <para>Make sure you are using a supported Linux distribution. At |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1016 | present, Helgrind only properly supports x86-linux and amd64-linux |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1017 | with glibc-2.3 or later. The latter restriction means we only |
| 1018 | support glibc's NPTL threading implementation. The old |
| 1019 | LinuxThreads implementation is not supported.</para> |
| 1020 | |
| 1021 | <para>Unsupported targets may work to varying degrees. In |
| 1022 | particular ppc32-linux and ppc64-linux running NTPL should work, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1023 | but you will get false race errors because Helgrind does not know |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1024 | how to properly handle atomic instruction sequences created using |
| 1025 | the lwarx/stwcx instructions.</para> |
| 1026 | </listitem> |
| 1027 | |
| 1028 | <listitem> |
| 1029 | <para>Round up all finished threads using pthread_join. Avoid |
| 1030 | detaching threads: don't create threads in the detached state, and |
| 1031 | don't call pthread_detach on existing threads.</para> |
| 1032 | |
| 1033 | <para>Using pthread_join to round up finished threads provides a |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1034 | clear synchronisation point that both Helgrind and programmers can |
| 1035 | see. This synchronisation point allows Helgrind to adjust its |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1036 | memory ownership |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1037 | models <link linkend="hg-manual.data-races.exclusive">as described |
| 1038 | extensively above</link>, which helps Helgrind produce more |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1039 | accurate error reports.</para> |
| 1040 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1041 | <para>If you don't call pthread_join on a thread, Helgrind has no |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1042 | way to know when it finishes, relative to any significant |
| 1043 | synchronisation points for other threads in the program. So it |
| 1044 | assumes that the thread lingers indefinitely and can potentially |
| 1045 | interfere indefinitely with the memory state of the program. It |
| 1046 | has every right to assume that -- after all, it might really be |
| 1047 | the case that, for scheduling reasons, the exiting thread did run |
| 1048 | very slowly in the last stages of its life.</para> |
| 1049 | </listitem> |
| 1050 | |
| 1051 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1052 | <para>Perform thread debugging (with Helgrind) and memory |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1053 | debugging (with Memcheck) together.</para> |
| 1054 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1055 | <para>Helgrind tracks the state of memory in detail, and memory |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1056 | management bugs in the application are liable to cause confusion. |
| 1057 | In extreme cases, applications which do many invalid reads and |
| 1058 | writes (particularly to freed memory) have been known to crash |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1059 | Helgrind. So, ideally, you should make your application |
| 1060 | Memcheck-clean before using Helgrind.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1061 | |
| 1062 | <para>It may be impossible to make your application Memcheck-clean |
| 1063 | unless you first remove threading bugs. In particular, it may be |
| 1064 | difficult to remove all reads and writes to freed memory in |
| 1065 | multithreaded C++ destructor sequences at program termination. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1066 | So, ideally, you should make your application Helgrind-clean |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1067 | before using Memcheck.</para> |
| 1068 | |
| 1069 | <para>Since this circularity is obviously unresolvable, at least |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1070 | bear in mind that Memcheck and Helgrind are to some extent |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1071 | complementary, and you may need to use them together.</para> |
| 1072 | </listitem> |
| 1073 | |
| 1074 | <listitem> |
| 1075 | <para>POSIX requires that implementations of standard I/O (printf, |
| 1076 | fprintf, fwrite, fread, etc) are thread safe. Unfortunately GNU |
| 1077 | libc implements this by using internal locking primitives that |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1078 | Helgrind is unable to intercept. Consequently Helgrind generates |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1079 | many false race reports when you use these functions.</para> |
| 1080 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1081 | <para>Helgrind attempts to hide these errors using the standard |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1082 | Valgrind error-suppression mechanism. So, at least for simple |
| 1083 | test cases, you don't see any. Nevertheless, some may slip |
| 1084 | through. Just something to be aware of.</para> |
| 1085 | </listitem> |
| 1086 | |
| 1087 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1088 | <para>Helgrind's error checks do not work properly inside the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1089 | system threading library itself |
| 1090 | (<computeroutput>libpthread.so</computeroutput>), and it usually |
| 1091 | observes large numbers of (false) errors in there. Valgrind's |
| 1092 | suppression system then filters these out, so you should not see |
| 1093 | them.</para> |
| 1094 | |
| 1095 | <para>If you see any race errors reported |
| 1096 | where <computeroutput>libpthread.so</computeroutput> or |
| 1097 | <computeroutput>ld.so</computeroutput> is the object associated |
| 1098 | with the innermost stack frame, please file a bug report at |
| 1099 | http://www.valgrind.org.</para> |
| 1100 | </listitem> |
| 1101 | |
| 1102 | </orderedlist> |
| 1103 | |
| 1104 | </sect1> |
| 1105 | |
| 1106 | |
| 1107 | |
| 1108 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1109 | <sect1 id="hg-manual.options" xreflabel="Helgrind Options"> |
| 1110 | <title>Helgrind Options</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1111 | |
| 1112 | <para>The following end-user options are available:</para> |
| 1113 | |
| 1114 | <!-- start of xi:include in the manpage --> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1115 | <variablelist id="hg.opts.list"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1116 | |
| 1117 | <varlistentry id="opt.happens-before" xreflabel="--happens-before"> |
| 1118 | <term> |
| 1119 | <option><![CDATA[--happens-before=none|threads|all |
| 1120 | [default: all] ]]></option> |
| 1121 | </term> |
| 1122 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1123 | <para>Helgrind always regards locks as the basis for |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1124 | inter-thread synchronisation. However, by default, before |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1125 | reporting a race error, Helgrind will also check whether |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1126 | certain other kinds of inter-thread synchronisation events |
| 1127 | happened. It may be that if such events took place, then no |
| 1128 | race really occurred, and so no error needs to be reported. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1129 | See <link linkend="hg-manual.data-races.exclusive">above</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1130 | for a discussion of transfers of exclusive ownership states |
| 1131 | between threads. |
| 1132 | </para> |
| 1133 | <para>With <varname>--happens-before=all</varname>, the |
| 1134 | following events are regarded as sources of synchronisation: |
| 1135 | thread creation/joinage, condition variable |
| 1136 | signal/broadcast/waits, and semaphore posts/waits. |
| 1137 | </para> |
| 1138 | <para>With <varname>--happens-before=threads</varname>, only |
| 1139 | thread creation/joinage events are regarded as sources of |
| 1140 | synchronisation. |
| 1141 | </para> |
| 1142 | <para>With <varname>--happens-before=none</varname>, no events |
| 1143 | (apart, of course, from locking) are regarded as sources of |
| 1144 | synchronisation. |
| 1145 | </para> |
| 1146 | <para>Changing this setting from the default will increase your |
| 1147 | false-error rate but give little or no gain. The only advantage |
| 1148 | is that <option>--happens-before=threads</option> and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1149 | <option>--happens-before=none</option> should make Helgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1150 | less and less sensitive to the scheduling of threads, and hence |
| 1151 | the output more and more repeatable across runs. |
| 1152 | </para> |
| 1153 | </listitem> |
| 1154 | </varlistentry> |
| 1155 | |
| 1156 | <varlistentry id="opt.trace-addr" xreflabel="--trace-addr"> |
| 1157 | <term> |
| 1158 | <option><![CDATA[--trace-addr=0xXXYYZZ |
| 1159 | ]]></option> and |
| 1160 | <option><![CDATA[--trace-level=0|1|2 [default: 1] |
| 1161 | ]]></option> |
| 1162 | </term> |
| 1163 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1164 | <para>Requests that Helgrind produces a log of all state changes |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1165 | to location 0xXXYYZZ. This can be helpful in tracking down |
| 1166 | tricky races. <varname>--trace-level</varname> controls the |
| 1167 | verbosity of the log. At the default setting (1), a one-line |
| 1168 | summary of is printed for each state change. At level 2 a |
| 1169 | complete stack trace is printed for each state change.</para> |
| 1170 | </listitem> |
| 1171 | </varlistentry> |
| 1172 | |
| 1173 | </variablelist> |
| 1174 | <!-- end of xi:include in the manpage --> |
| 1175 | |
| 1176 | <!-- start of xi:include in the manpage --> |
| 1177 | <para>In addition, the following debugging options are available for |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1178 | Helgrind:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1179 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1180 | <variablelist id="hg.debugopts.list"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1181 | |
| 1182 | <varlistentry id="opt.trace-malloc" xreflabel="--trace-malloc"> |
| 1183 | <term> |
| 1184 | <option><![CDATA[--trace-malloc=no|yes [no] |
| 1185 | ]]></option> |
| 1186 | </term> |
| 1187 | <listitem> |
| 1188 | <para>Show all client malloc (etc) and free (etc) requests.</para> |
| 1189 | </listitem> |
| 1190 | </varlistentry> |
| 1191 | |
| 1192 | <varlistentry id="opt.gen-vcg" xreflabel="--gen-vcg"> |
| 1193 | <term> |
| 1194 | <option><![CDATA[--gen-vcg=no|yes|yes-w-vts [no] |
| 1195 | ]]></option> |
| 1196 | </term> |
| 1197 | <listitem> |
| 1198 | <para>At exit, write to stderr a dump of the happens-before |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1199 | graph computed by Helgrind, in a format suitable for the VCG |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1200 | graph visualisation tool. A suitable command line is:</para> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1201 | <para><computeroutput>valgrind --tool=helgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1202 | --gen-vcg=yes my_app 2>&1 |
| 1203 | | grep xxxxxx | sed "s/xxxxxx//g" |
| 1204 | | xvcg -</computeroutput></para> |
| 1205 | <para>With <varname>--gen-vcg=yes</varname>, the basic |
| 1206 | happens-before graph is shown. With |
| 1207 | <varname>--gen-vcg=yes-w-vts</varname>, the vector timestamp |
| 1208 | for each node is also shown.</para> |
| 1209 | </listitem> |
| 1210 | </varlistentry> |
| 1211 | |
| 1212 | <varlistentry id="opt.cmp-race-err-addrs" |
| 1213 | xreflabel="--cmp-race-err-addrs"> |
| 1214 | <term> |
| 1215 | <option><![CDATA[--cmp-race-err-addrs=no|yes [no] |
| 1216 | ]]></option> |
| 1217 | </term> |
| 1218 | <listitem> |
| 1219 | <para>Controls whether or not race (data) addresses should be |
| 1220 | taken into account when removing duplicates of race errors. |
| 1221 | With <varname>--cmp-race-err-addrs=no</varname>, two otherwise |
| 1222 | identical race errors will be considered to be the same if |
| 1223 | their race addresses differ. With |
| 1224 | With <varname>--cmp-race-err-addrs=yes</varname> they will be |
| 1225 | considered different. This is provided to help make certain |
| 1226 | regression tests work reliably.</para> |
| 1227 | </listitem> |
| 1228 | </varlistentry> |
| 1229 | |
| 1230 | <varlistentry id="opt.tc-sanity-flags" xreflabel="--tc-sanity-flags"> |
| 1231 | <term> |
| 1232 | <option><![CDATA[--tc-sanity-flags=<XXXXX> (X = 0|1) [00000] |
| 1233 | ]]></option> |
| 1234 | </term> |
| 1235 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1236 | <para>Run extensive sanity checks on Helgrind's internal |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1237 | data structures at events defined by the bitstring, as |
| 1238 | follows:</para> |
| 1239 | <para><computeroutput>10000 </computeroutput>after changes to |
| 1240 | the lock order acquisition graph</para> |
| 1241 | <para><computeroutput>01000 </computeroutput>after every client |
| 1242 | memory access (NB: not currently used)</para> |
| 1243 | <para><computeroutput>00100 </computeroutput>after every client |
| 1244 | memory range permission setting of 256 bytes or greater</para> |
| 1245 | <para><computeroutput>00010 </computeroutput>after every client |
| 1246 | lock or unlock event</para> |
| 1247 | <para><computeroutput>00001 </computeroutput>after every client |
| 1248 | thread creation or joinage event</para> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1249 | <para>Note these will make Helgrind run very slowly, often to |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1250 | the point of being completely unusable.</para> |
| 1251 | </listitem> |
| 1252 | </varlistentry> |
| 1253 | |
| 1254 | </variablelist> |
| 1255 | <!-- end of xi:include in the manpage --> |
| 1256 | |
| 1257 | |
| 1258 | </sect1> |
| 1259 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame^] | 1260 | <sect1 id="hg-manual.todolist" xreflabel="To Do List"> |
| 1261 | <title>A To-Do List for Helgrind</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1262 | |
| 1263 | <para>The following is a list of loose ends which should be tidied up |
| 1264 | some time.</para> |
| 1265 | |
| 1266 | <itemizedlist> |
| 1267 | <listitem><para>Track which mutexes are associated with which |
| 1268 | condition variables, and emit a warning if this becomes |
| 1269 | inconsistent.</para> |
| 1270 | </listitem> |
| 1271 | <listitem><para>For lock order errors, print the complete lock |
| 1272 | cycle, rather than only doing for size-2 cycles as at |
| 1273 | present.</para> |
| 1274 | </listitem> |
| 1275 | <listitem><para>Document the VALGRIND_HG_CLEAN_MEMORY client |
| 1276 | request.</para> |
| 1277 | </listitem> |
| 1278 | <listitem><para>Possibly a client request to forcibly transfer |
| 1279 | ownership of memory from one thread to another. Requires further |
| 1280 | consideration.</para> |
| 1281 | </listitem> |
| 1282 | <listitem><para>Add a new client request that marks an address range |
| 1283 | as being "shared-modified with empty lockset" (the error state), |
| 1284 | and describe how to use it.</para> |
| 1285 | </listitem> |
| 1286 | <listitem><para>Document races caused by gcc's thread-unsafe code |
| 1287 | generation for speculative stores. In the interim see |
| 1288 | <computeroutput>http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html |
| 1289 | </computeroutput> |
| 1290 | and <computeroutput>http://lkml.org/lkml/2007/10/24/673</computeroutput>. |
| 1291 | </para> |
| 1292 | </listitem> |
| 1293 | <listitem><para>Don't update the lock-order graph, and don't check |
| 1294 | for errors, when a "try"-style lock operation happens (eg |
| 1295 | pthread_mutex_trylock). Such calls do not add any real |
| 1296 | restrictions to the locking order, since they can always fail to |
| 1297 | acquire the lock, resulting in the caller going off and doing Plan |
| 1298 | B (presumably it will have a Plan B). Doing such checks could |
| 1299 | generate false lock-order errors and confuse users.</para> |
| 1300 | </listitem> |
| 1301 | <listitem><para> Performance can be very poor. Slowdowns on the |
| 1302 | order of 100:1 are not unusual. There is quite some scope for |
| 1303 | performance improvements, though. |
| 1304 | </para> |
| 1305 | </listitem> |
| 1306 | |
| 1307 | </itemizedlist> |
| 1308 | |
| 1309 | </sect1> |
| 1310 | |
| 1311 | </chapter> |