sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1 | <?xml version="1.0"?> <!-- -*- sgml -*- --> |
| 2 | <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 3 | "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" |
| 4 | [ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 5 | |
| 6 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 7 | <chapter id="hg-manual" xreflabel="Helgrind: thread error detector"> |
| 8 | <title>Helgrind: a thread error detector</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 9 | |
| 10 | <para>To use this tool, you must specify |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 11 | <computeroutput>--tool=helgrind</computeroutput> on the Valgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 12 | command line.</para> |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 17 | <sect1 id="hg-manual.overview" xreflabel="Overview"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 18 | <title>Overview</title> |
| 19 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 20 | <para>Helgrind is a Valgrind tool for detecting synchronisation errors |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 21 | in C, C++ and Fortran programs that use the POSIX pthreads |
| 22 | threading primitives.</para> |
| 23 | |
| 24 | <para>The main abstractions in POSIX pthreads are: a set of threads |
| 25 | sharing a common address space, thread creation, thread joinage, |
| 26 | thread exit, mutexes (locks), condition variables (inter-thread event |
| 27 | notifications), reader-writer locks, and semaphores.</para> |
| 28 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 29 | <para>Helgrind is aware of all these abstractions and tracks their |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 30 | effects as accurately as it can. Currently it does not correctly |
| 31 | handle pthread barriers and pthread spinlocks, although it will not |
| 32 | object if you use them. On x86 and amd64 platforms, it understands |
| 33 | and partially handles implicit locking arising from the use of the |
| 34 | LOCK instruction prefix. |
| 35 | </para> |
| 36 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 37 | <para>Helgrind can detect three classes of errors, which are discussed |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 38 | in detail in the next three sections:</para> |
| 39 | |
| 40 | <orderedlist> |
| 41 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 42 | <para><link linkend="hg-manual.api-checks"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 43 | Misuses of the POSIX pthreads API.</link></para> |
| 44 | </listitem> |
| 45 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 46 | <para><link linkend="hg-manual.lock-orders"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 47 | Potential deadlocks arising from lock |
| 48 | ordering problems.</link></para> |
| 49 | </listitem> |
| 50 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 51 | <para><link linkend="hg-manual.data-races"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 52 | Data races -- accessing memory without adequate locking. |
| 53 | </link></para> |
| 54 | </listitem> |
| 55 | </orderedlist> |
| 56 | |
| 57 | <para>Following those is a section containing |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 58 | <link linkend="hg-manual.effective-use"> |
| 59 | hints and tips on how to get the best out of Helgrind.</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 60 | </para> |
| 61 | |
| 62 | <para>Then there is a |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 63 | <link linkend="hg-manual.options">summary of command-line |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 64 | options.</link> |
| 65 | </para> |
| 66 | |
| 67 | <para>Finally, there is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 68 | <link linkend="hg-manual.todolist">a brief summary of areas in which Helgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 69 | could be improved.</link> |
| 70 | </para> |
| 71 | |
| 72 | </sect1> |
| 73 | |
| 74 | |
| 75 | |
| 76 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 77 | <sect1 id="hg-manual.api-checks" xreflabel="API Checks"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 78 | <title>Detected errors: Misuses of the POSIX pthreads API</title> |
| 79 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 80 | <para>Helgrind intercepts calls to many POSIX pthreads functions, and |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 81 | is therefore able to report on various common problems. Although |
| 82 | these are unglamourous errors, their presence can lead to undefined |
| 83 | program behaviour and hard-to-find bugs later in execution. The |
| 84 | detected errors are:</para> |
| 85 | |
| 86 | <itemizedlist> |
| 87 | <listitem><para>unlocking an invalid mutex</para></listitem> |
| 88 | <listitem><para>unlocking a not-locked mutex</para></listitem> |
| 89 | <listitem><para>unlocking a mutex held by a different |
| 90 | thread</para></listitem> |
| 91 | <listitem><para>destroying an invalid or a locked mutex</para></listitem> |
| 92 | <listitem><para>recursively locking a non-recursive mutex</para></listitem> |
| 93 | <listitem><para>deallocation of memory that contains a |
| 94 | locked mutex</para></listitem> |
| 95 | <listitem><para>passing mutex arguments to functions expecting |
| 96 | reader-writer lock arguments, and vice |
| 97 | versa</para></listitem> |
| 98 | <listitem><para>when a POSIX pthread function fails with an |
| 99 | error code that must be handled</para></listitem> |
| 100 | <listitem><para>when a thread exits whilst still holding locked |
| 101 | locks</para></listitem> |
| 102 | <listitem><para>calling <computeroutput>pthread_cond_wait</computeroutput> |
| 103 | with a not-locked mutex, or one locked by a different |
| 104 | thread</para></listitem> |
| 105 | </itemizedlist> |
| 106 | |
| 107 | <para>Checks pertaining to the validity of mutexes are generally also |
| 108 | performed for reader-writer locks.</para> |
| 109 | |
| 110 | <para>Various kinds of this-can't-possibly-happen events are also |
| 111 | reported. These usually indicate bugs in the system threading |
| 112 | library.</para> |
| 113 | |
| 114 | <para>Reported errors always contain a primary stack trace indicating |
| 115 | where the error was detected. They may also contain auxiliary stack |
| 116 | traces giving additional information. In particular, most errors |
| 117 | relating to mutexes will also tell you where that mutex first came to |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 118 | Helgrind's attention (the "<computeroutput>was first observed |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 119 | at</computeroutput>" part), so you have a chance of figuring out which |
| 120 | mutex it is referring to. For example:</para> |
| 121 | |
| 122 | <programlisting><![CDATA[ |
| 123 | Thread #1 unlocked a not-locked lock at 0x7FEFFFA90 |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 124 | at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.c:492) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 125 | by 0x40073A: nearly_main (tc09_bad_unlock.c:27) |
| 126 | by 0x40079B: main (tc09_bad_unlock.c:50) |
| 127 | Lock at 0x7FEFFFA90 was first observed |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 128 | at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 129 | by 0x40071F: nearly_main (tc09_bad_unlock.c:23) |
| 130 | by 0x40079B: main (tc09_bad_unlock.c:50) |
| 131 | ]]></programlisting> |
| 132 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 133 | <para>Helgrind has a way of summarising thread identities, as |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 134 | evidenced here by the text "<computeroutput>Thread |
| 135 | #1</computeroutput>". This is so that it can speak about threads and |
| 136 | sets of threads without overwhelming you with details. See |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 137 | <link linkend="hg-manual.data-races.errmsgs">below</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 138 | for more information on interpreting error messages.</para> |
| 139 | |
| 140 | </sect1> |
| 141 | |
| 142 | |
| 143 | |
| 144 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 145 | <sect1 id="hg-manual.lock-orders" xreflabel="Lock Orders"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 146 | <title>Detected errors: Inconsistent Lock Orderings</title> |
| 147 | |
| 148 | <para>In this section, and in general, to "acquire" a lock simply |
| 149 | means to lock that lock, and to "release" a lock means to unlock |
| 150 | it.</para> |
| 151 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 152 | <para>Helgrind monitors the order in which threads acquire locks. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 153 | This allows it to detect potential deadlocks which could arise from |
| 154 | the formation of cycles of locks. Detecting such inconsistencies is |
| 155 | useful because, whilst actual deadlocks are fairly obvious, potential |
| 156 | deadlocks may never be discovered during testing and could later lead |
| 157 | to hard-to-diagnose in-service failures.</para> |
| 158 | |
| 159 | <para>The simplest example of such a problem is as |
| 160 | follows.</para> |
| 161 | |
| 162 | <itemizedlist> |
| 163 | <listitem><para>Imagine some shared resource R, which, for whatever |
| 164 | reason, is guarded by two locks, L1 and L2, which must both be held |
| 165 | when R is accessed.</para> |
| 166 | </listitem> |
| 167 | <listitem><para>Suppose a thread acquires L1, then L2, and proceeds |
| 168 | to access R. The implication of this is that all threads in the |
| 169 | program must acquire the two locks in the order first L1 then L2. |
| 170 | Not doing so risks deadlock.</para> |
| 171 | </listitem> |
| 172 | <listitem><para>The deadlock could happen if two threads -- call them |
| 173 | T1 and T2 -- both want to access R. Suppose T1 acquires L1 first, |
| 174 | and T2 acquires L2 first. Then T1 tries to acquire L2, and T2 tries |
| 175 | to acquire L1, but those locks are both already held. So T1 and T2 |
| 176 | become deadlocked.</para> |
| 177 | </listitem> |
| 178 | </itemizedlist> |
| 179 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 180 | <para>Helgrind builds a directed graph indicating the order in which |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 181 | locks have been acquired in the past. When a thread acquires a new |
| 182 | lock, the graph is updated, and then checked to see if it now contains |
| 183 | a cycle. The presence of a cycle indicates a potential deadlock involving |
| 184 | the locks in the cycle.</para> |
| 185 | |
| 186 | <para>In simple situations, where the cycle only contains two locks, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 187 | Helgrind will show where the required order was established:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 188 | |
| 189 | <programlisting><![CDATA[ |
| 190 | Thread #1: lock order "0x7FEFFFAB0 before 0x7FEFFFA80" violated |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 191 | at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 192 | by 0x40081F: main (tc13_laog1.c:24) |
| 193 | Required order was established by acquisition of lock at 0x7FEFFFAB0 |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 194 | at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 195 | by 0x400748: main (tc13_laog1.c:17) |
| 196 | followed by a later acquisition of lock at 0x7FEFFFA80 |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 197 | at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 198 | by 0x400773: main (tc13_laog1.c:18) |
| 199 | ]]></programlisting> |
| 200 | |
| 201 | <para>When there are more than two locks in the cycle, the error is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 202 | equally serious. However, at present Helgrind does not show the locks |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 203 | involved, so as to avoid flooding you with information. That could be |
| 204 | fixed in future. For example, here is a an example involving a cycle |
| 205 | of five locks from a naive implementation the famous Dining |
| 206 | Philosophers problem |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 207 | (see <computeroutput>helgrind/tests/tc14_laog_dinphils.c</computeroutput>). |
| 208 | In this case Helgrind has detected that all 5 philosophers could |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 209 | simultaneously pick up their left fork and then deadlock whilst |
| 210 | waiting to pick up their right forks.</para> |
| 211 | |
| 212 | <programlisting><![CDATA[ |
| 213 | Thread #6: lock order "0x6010C0 before 0x601160" violated |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 214 | at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 215 | by 0x4007C0: dine (tc14_laog_dinphils.c:19) |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 216 | by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 217 | by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so) |
| 218 | by 0x51054CC: clone (in /lib64/libc-2.5.so) |
| 219 | ]]></programlisting> |
| 220 | |
| 221 | </sect1> |
| 222 | |
| 223 | |
| 224 | |
| 225 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 226 | <sect1 id="hg-manual.data-races" xreflabel="Data Races"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 227 | <title>Detected errors: Data Races</title> |
| 228 | |
| 229 | <para>A data race happens, or could happen, when two threads |
| 230 | access a shared memory location without using suitable locks to |
| 231 | ensure single-threaded access. Such missing locking can cause |
| 232 | obscure timing dependent bugs. Ensuring programs are race-free is |
| 233 | one of the central difficulties of threaded programming.</para> |
| 234 | |
| 235 | <para>Reliably detecting races is a difficult problem, and most |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 236 | of Helgrind's internals are devoted to do dealing with it. |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 237 | As a consequence this section is somewhat long and involved. |
| 238 | We begin with a simple example.</para> |
| 239 | |
| 240 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 241 | <sect2 id="hg-manual.data-races.example" xreflabel="Simple Race"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 242 | <title>A Simple Data Race</title> |
| 243 | |
| 244 | <para>About the simplest possible example of a race is as follows. In |
| 245 | this program, it is impossible to know what the value |
| 246 | of <computeroutput>var</computeroutput> is at the end of the program. |
| 247 | Is it 2 ? Or 1 ?</para> |
| 248 | |
| 249 | <programlisting><![CDATA[ |
| 250 | #include <pthread.h> |
| 251 | |
| 252 | int var = 0; |
| 253 | |
| 254 | void* child_fn ( void* arg ) { |
| 255 | var++; /* Unprotected relative to parent */ /* this is line 6 */ |
| 256 | return NULL; |
| 257 | } |
| 258 | |
| 259 | int main ( void ) { |
| 260 | pthread_t child; |
| 261 | pthread_create(&child, NULL, child_fn, NULL); |
| 262 | var++; /* Unprotected relative to child */ /* this is line 13 */ |
| 263 | pthread_join(child, NULL); |
| 264 | return 0; |
| 265 | } |
| 266 | ]]></programlisting> |
| 267 | |
| 268 | <para>The problem is there is nothing to |
| 269 | stop <computeroutput>var</computeroutput> being updated simultaneously |
| 270 | by both threads. A correct program would |
| 271 | protect <computeroutput>var</computeroutput> with a lock of type |
| 272 | <computeroutput>pthread_mutex_t</computeroutput>, which is acquired |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 273 | before each access and released afterwards. Helgrind's output for |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 274 | this program is:</para> |
| 275 | |
| 276 | <programlisting><![CDATA[ |
| 277 | Thread #1 is the program's root thread |
| 278 | |
| 279 | Thread #2 was created |
| 280 | at 0x510548E: clone (in /lib64/libc-2.5.so) |
| 281 | by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so) |
| 282 | by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so) |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 283 | by 0x4C23870: pthread_create@* (hg_intercepts.c:198) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 284 | by 0x4005F1: main (simple_race.c:12) |
| 285 | |
| 286 | Possible data race during write of size 4 at 0x601034 |
| 287 | at 0x4005F2: main (simple_race.c:13) |
| 288 | Old state: shared-readonly by threads #1, #2 |
| 289 | New state: shared-modified by threads #1, #2 |
| 290 | Reason: this thread, #1, holds no consistent locks |
| 291 | Location 0x601034 has never been protected by any lock |
| 292 | ]]></programlisting> |
| 293 | |
| 294 | <para>This is quite a lot of detail for an apparently simple error. |
| 295 | The last clause is the main error message. It says there is a race as |
| 296 | a result of a write of size 4 (bytes), at 0x601034, which is |
| 297 | presumably the address of <computeroutput>var</computeroutput>, |
| 298 | happening in function <computeroutput>main</computeroutput> at line 13 |
| 299 | in the program.</para> |
| 300 | |
| 301 | <para>Note that it is purely by chance that the race is |
| 302 | reported for the parent thread's access. It could equally have been |
| 303 | reported instead for the child's access, at line 6. The error will |
| 304 | only be reported for one of the locations, since neither the parent |
| 305 | nor child is, by itself, incorrect. It is only when both access |
| 306 | <computeroutput>var</computeroutput> without a lock that an error |
| 307 | exists.</para> |
| 308 | |
| 309 | <para>The error message shows some other interesting details. The |
| 310 | sections below explain them. Here we merely note their presence:</para> |
| 311 | |
| 312 | <itemizedlist> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 313 | <listitem><para>Helgrind maintains some kind of state machine for the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 314 | memory location in question, hence the "<computeroutput>Old |
| 315 | state:</computeroutput>" and "<computeroutput>New |
| 316 | state:</computeroutput>" lines.</para> |
| 317 | </listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 318 | <listitem><para>Helgrind keeps track of which threads have accessed |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 319 | the location: "<computeroutput>threads #1, #2</computeroutput>". |
| 320 | Before printing the main error message, it prints the creation |
| 321 | points of these two threads, so you can see which threads it is |
| 322 | referring to.</para> |
| 323 | </listitem> |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 324 | <listitem><para>Helgrind tries to provide an explanation of why the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 325 | race exists: "<computeroutput>Location 0x601034 has never been |
| 326 | protected by any lock</computeroutput>".</para> |
| 327 | </listitem> |
| 328 | </itemizedlist> |
| 329 | |
| 330 | <para>Understanding the memory state machine is central to |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 331 | understanding Helgrind's race-detection algorithm. The next three |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 332 | subsections explain this.</para> |
| 333 | |
| 334 | </sect2> |
| 335 | |
| 336 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 337 | <sect2 id="hg-manual.data-races.memstates" xreflabel="Memory States"> |
| 338 | <title>Helgrind's Memory State Machine</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 339 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 340 | <para>Helgrind tracks the state of every byte of memory used by your |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 341 | program. There are a number of states, but only three are |
| 342 | interesting:</para> |
| 343 | |
| 344 | <itemizedlist> |
| 345 | <listitem><para>Exclusive: memory in this state is regarded as owned |
| 346 | exclusively by one particular thread. That thread may read and |
| 347 | write it without a lock. Even in highly threaded programs, the |
| 348 | majority of locations never leave the Exclusive state, since most |
| 349 | data is thread-private.</para> |
| 350 | </listitem> |
| 351 | <listitem><para>Shared-Readonly: memory in this state is regarded as |
| 352 | shared by multiple threads. In this state, any thread may read the |
| 353 | memory without a lock, reflecting the fact that readonly data may |
| 354 | safely be shared between threads without locking.</para> |
| 355 | </listitem> |
| 356 | <listitem><para>Shared-Modified: memory in this state is regarded as |
| 357 | shared by multiple threads, at least one of which has written to it. |
| 358 | All participating threads must hold at least one lock in common when |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 359 | accessing the memory. If no such lock exists, Helgrind reports a |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 360 | race error.</para> |
| 361 | </listitem> |
| 362 | </itemizedlist> |
| 363 | |
| 364 | <para>Let's review the simple example above with this in mind. When |
| 365 | the program starts, <computeroutput>var</computeroutput> is not in any |
| 366 | of these states. Either the parent or child thread gets to its |
| 367 | <computeroutput>var++</computeroutput> first, and thereby |
| 368 | thereby gets Exclusive ownership of the location.</para> |
| 369 | |
| 370 | <para>The later-running thread now arrives at |
| 371 | its <computeroutput>var++</computeroutput> statement. It first reads |
| 372 | the existing value from memory. |
| 373 | Because <computeroutput>var</computeroutput> is currently marked as |
| 374 | owned exclusively by the other thread, its state is changed to |
| 375 | shared-readonly by both threads.</para> |
| 376 | |
| 377 | <para>This same thread adds one to the value it has and stores it back |
| 378 | in <computeroutput>var</computeroutput>. This causes another state |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 379 | change, this time to the shared-modified state. Because Helgrind has |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 380 | also been tracking which threads hold which locks, it can see that |
| 381 | <computeroutput>var</computeroutput> is in shared-modified state but |
| 382 | no lock has been used to consistently protect it. Hence a race is |
| 383 | reported exactly at the transition from shared-readonly to |
| 384 | shared-modified.</para> |
| 385 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 386 | <para>The essence of the algorithm is this. Helgrind keeps track of |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 387 | each memory location that has been accessed by more than one thread. |
| 388 | For each such location it incrementally infers the set of locks which |
| 389 | have consistently been used to protect that location. If the |
| 390 | location's lockset becomes empty, and at some point one of the threads |
| 391 | attempts to write to it, a race is then reported.</para> |
| 392 | |
| 393 | <para>This technique is known as "lockset inference" and was |
| 394 | introduced in: "Eraser: A Dynamic Data Race Detector for Multithreaded |
| 395 | Programs" (Stefan Savage, Michael Burrows, Greg Nelson, Patrick |
| 396 | Sobalvarro and Thomas Anderson, ACM Transactions on Computer Systems, |
| 397 | 15(4):391-411, November 1997).</para> |
| 398 | |
| 399 | <para>Lockset inference has since been widely implemented, studied and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 400 | extended. Helgrind incorporates several refinements aimed at avoiding |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 401 | the high false error rate that naive versions of the algorithm suffer |
| 402 | from. A |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 403 | <link linkend="hg-manual.data-races.summary">summary of the complete |
| 404 | algorithm used by Helgrind</link> is presented below. First, however, |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 405 | it is important to understand details of transitions pertaining to the |
| 406 | Exclusive-ownership state.</para> |
| 407 | |
| 408 | </sect2> |
| 409 | |
| 410 | |
| 411 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 412 | <sect2 id="hg-manual.data-races.exclusive" xreflabel="Excl Transfers"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 413 | <title>Transfers of Exclusive Ownership Between Threads</title> |
| 414 | |
| 415 | <para>As presented, the algorithm is far too strict. It reports many |
| 416 | errors in perfectly correct, widely used parallel programming |
| 417 | constructions, for example, using child worker threads and worker |
| 418 | thread pools.</para> |
| 419 | |
| 420 | <para>To avoid these false errors, we must refine the algorithm so |
| 421 | that it keeps memory in an Exclusive ownership state in cases where it |
| 422 | would otherwise decay into a shared-readonly or shared-modified state. |
| 423 | Recall that Exclusive ownership is special in that it grants the |
| 424 | owning thread the right to access memory without use of any locks. In |
| 425 | order to support worker-thread and worker-thread-pool idioms, we will |
| 426 | allow threads to steal exclusive ownership of memory from other |
| 427 | threads under certain circumstances.</para> |
| 428 | |
| 429 | <para>Here's an example. Imagine a parent thread creates child |
| 430 | threads to do units of work. For each unit of work, the parent |
| 431 | allocates a work buffer, fills it in, and creates the child thread, |
| 432 | handing it a pointer to the buffer. The child reads/writes the buffer |
| 433 | and eventually exits, and the waiting parent then extracts the results |
| 434 | from the buffer:</para> |
| 435 | |
| 436 | <programlisting><![CDATA[ |
| 437 | typedef ... Buffer; |
| 438 | |
| 439 | pthread_t child; |
| 440 | Buffer buf; |
| 441 | |
| 442 | /* ---- Parent ---- */ /* ---- Child ---- */ |
| 443 | |
| 444 | /* parent writes workload into buf */ |
| 445 | pthread_create( &child, child_fn, &buf ); |
| 446 | |
| 447 | /* parent does not read */ void child_fn ( Buffer* buf ) { |
| 448 | /* or write buf */ /* read/write buf */ |
| 449 | } |
| 450 | |
| 451 | pthread_join ( child ); |
| 452 | /* parent reads results from buf */ |
| 453 | ]]></programlisting> |
| 454 | |
| 455 | <para>Although <computeroutput>buf</computeroutput> is accessed by |
| 456 | both threads, neither uses locks, yet the program is race-free. The |
| 457 | essential observation is that the child's creation and exit create |
| 458 | synchronisation events between it and the parent. These force the |
| 459 | child's accesses to <computeroutput>buf</computeroutput> to happen |
| 460 | after the parent initialises <computeroutput>buf</computeroutput>, and |
| 461 | before the parent reads the results |
| 462 | from <computeroutput>buf</computeroutput>.</para> |
| 463 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 464 | <para>To model this, Helgrind allows the child to steal, from the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 465 | parent, exclusive ownership of any memory exclusively owned by the |
| 466 | parent before the pthread_create call. Similarly, once the parent's |
| 467 | pthread_join call returns, it can steal back ownership of memory |
| 468 | exclusively owned by the child. In this way ownership |
| 469 | of <computeroutput>buf</computeroutput> is transferred from parent to |
| 470 | child and back, so the basic algorithm does not report any races |
| 471 | despite the absence of any locking.</para> |
| 472 | |
| 473 | <para>Note that the child may only steal memory owned by the parent |
| 474 | prior to the pthread_create call. If the child attempts to read or |
| 475 | write memory which is also accessed by the parent in between the |
| 476 | pthread_create and pthread_join calls, an error is still |
| 477 | reported.</para> |
| 478 | |
| 479 | <para>This technique was introduced with the name "thread lifetime |
| 480 | segments" in "Runtime Checking of Multithreaded Applications with |
| 481 | Visual Threads" (Jerry J. Harrow, Jr, Proceedings of the 7th |
| 482 | International SPIN Workshop on Model Checking of Software Stanford, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 483 | California, USA, August 2000, LNCS 1885, pp331--342). Helgrind |
| 484 | implements an extended version of it. Specifically, Helgrind allows |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 485 | transfer of exclusive ownership in the following situations:</para> |
| 486 | |
| 487 | <itemizedlist> |
| 488 | <listitem><para>At thread creation: a child can acquire ownership of |
| 489 | memory held exclusively by the parent prior to the child's |
| 490 | creation.</para> |
| 491 | </listitem> |
| 492 | <listitem><para>At thread joining: the joiner (thread not exiting) |
| 493 | can acquire ownership of memory held exclusively by the joinee |
| 494 | (thread that is exiting) at the point it exited.</para> |
| 495 | </listitem> |
| 496 | <listitem><para>At condition variable signallings and broadcasts. A |
| 497 | thread Tw which completes a pthread_cond_wait call as a result of |
| 498 | a signal or broadcast on the same condition variable by some other |
| 499 | thread Ts, may acquire ownership of memory held exclusively by |
| 500 | Ts prior to the pthread_cond_signal/broadcast |
| 501 | call.</para> |
| 502 | </listitem> |
| 503 | <listitem><para>At semaphore posts (sem_post) calls. A thread Tw |
| 504 | which completes a sem_wait call call as a result of a sem_post call |
| 505 | on the same semaphore by some other thread Tp, may acquire |
| 506 | ownership of memory held exclusively by Tp prior to the sem_post |
| 507 | call.</para> |
| 508 | </listitem> |
| 509 | </itemizedlist> |
| 510 | |
| 511 | </sect2> |
| 512 | |
| 513 | |
| 514 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 515 | <sect2 id="hg-manual.data-races.re-excl" xreflabel="Re-Excl Transfers"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 516 | <title>Restoration of Exclusive Ownership</title> |
| 517 | |
| 518 | <para>Another common idiom is to partition the lifetime of the program |
| 519 | as a whole into several distinct phases. In some of those phases, a |
| 520 | memory location may be accessed by multiple threads and so require |
| 521 | locking. In other phases only one thread exists and so can access the |
| 522 | memory without locking. For example:</para> |
| 523 | |
| 524 | <programlisting><![CDATA[ |
| 525 | int var = 0; /* shared variable */ |
| 526 | pthread_mutex_t mx = PTHREAD_MUTEX_INITIALIZER; /* guard for var */ |
| 527 | pthread_t child; |
| 528 | |
| 529 | /* ---- Parent ---- */ /* ---- Child ---- */ |
| 530 | |
| 531 | var += 1; /* no lock used */ |
| 532 | |
| 533 | pthread_create( &child, child_fn, NULL ); |
| 534 | |
| 535 | void child_fn ( void* uu ) { |
| 536 | pthread_mutex_lock(&mx); pthread_mutex_lock(&mx); |
| 537 | var += 2; var += 3; |
| 538 | pthread_mutex_unlock(&mx); pthread_mutex_unlock(&mx); |
| 539 | } |
| 540 | |
| 541 | pthread_join ( child ); |
| 542 | |
| 543 | var += 4; /* no lock used */ |
| 544 | ]]></programlisting> |
| 545 | |
| 546 | <para>This program is correct, but using only the mechanisms described |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 547 | so far, Helgrind would report an error at |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 548 | <computeroutput>var += 4</computeroutput>. This is because, by that |
| 549 | point, <computeroutput>var</computeroutput> is marked as being in the |
| 550 | state "shared-modified and protected by the |
| 551 | lock <computeroutput>mx</computeroutput>", but is being accessed |
| 552 | without locking. Really, what we want is |
| 553 | for <computeroutput>var</computeroutput> to return to the parent |
| 554 | thread's exclusive ownership after the child thread has exited.</para> |
| 555 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 556 | <para>To make this possible, for every memory location Helgrind also keeps |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 557 | track of all the threads that have accessed that location |
| 558 | -- its threadset. When a thread Tquitter joins back to Tstayer, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 559 | Helgrind examines the locksets of all memory in shared-modified or |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 560 | shared-readable state. In each such lockset, if Tquitter is |
| 561 | mentioned, it is removed and replaced by Tstayer. If, as a result, a |
| 562 | lockset becomes a singleton set containing Tstayer, then the |
| 563 | location's state is changed to belongs-exclusively-to-Tstayer.</para> |
| 564 | |
| 565 | <para>In our example, the result is exactly as we desire: |
| 566 | <computeroutput>var</computeroutput> is reacquired exclusively by the |
| 567 | parent after the child exits.</para> |
| 568 | |
| 569 | <para>More generally, when a group of threads merges back to a single |
| 570 | thread via a cascade of pthread_join calls, any memory shared by the |
| 571 | group (or a subset of it) ends up being owned exclusively by the sole |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 572 | surviving thread. This significantly enhances Helgrind's flexibility, |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 573 | since it means that each memory location may make arbitrarily many |
| 574 | transitions between exclusive and shared ownership. Furthermore, a |
| 575 | different lock may protect the location during each period of shared |
| 576 | ownership.</para> |
| 577 | |
| 578 | </sect2> |
| 579 | |
| 580 | |
| 581 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 582 | <sect2 id="hg-manual.data-races.summary" xreflabel="Race Det Summary"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 583 | <title>A Summary of the Race Detection Algorithm</title> |
| 584 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 585 | <para>Helgrind looks for memory locations which are accessed by more |
| 586 | than one thread. For each such location, Helgrind records which of |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 587 | the program's locks were held by the accessing thread at the time of |
| 588 | each access. The hope is to discover that there is indeed at least |
| 589 | one lock which is consistently used by all threads to protect that |
| 590 | location. If no such lock can be found, then there is apparently no |
| 591 | consistent locking strategy being applied for that location, and so a |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 592 | possible data race might result. Helgrind accordingly reports an |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 593 | error.</para> |
| 594 | |
| 595 | <para>In practice this discipline is far too simplistic, and is |
| 596 | unusable since it reports many races in some widely used and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 597 | known-correct programming disciplines. Helgrind's checking therefore |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 598 | incorporates many refinements to this basic idea, and can be |
| 599 | summarised as follows:</para> |
| 600 | |
| 601 | <para>The following thread events are intercepted and monitored:</para> |
| 602 | |
| 603 | <itemizedlist> |
| 604 | <listitem><para>thread creation and exiting (pthread_create, |
| 605 | pthread_join, pthread_exit)</para> |
| 606 | </listitem> |
| 607 | <listitem> |
| 608 | <para>lock acquisition and release (pthread_mutex_lock, |
| 609 | pthread_mutex_unlock, pthread_rwlock_rdlock, |
| 610 | pthread_rwlock_wrlock, |
| 611 | pthread_rwlock_unlock)</para> |
| 612 | </listitem> |
| 613 | <listitem> |
| 614 | <para>inter-thread event notifications (pthread_cond_wait, |
| 615 | pthread_cond_signal, pthread_cond_broadcast, |
| 616 | sem_wait, sem_post)</para> |
| 617 | </listitem> |
| 618 | </itemizedlist> |
| 619 | |
| 620 | <para>Memory allocation and deallocation events are intercepted and |
| 621 | monitored:</para> |
| 622 | |
| 623 | <itemizedlist> |
| 624 | <listitem> |
| 625 | <para>malloc/new/free/delete and variants</para> |
| 626 | </listitem> |
| 627 | <listitem> |
| 628 | <para>stack allocation and deallocation</para> |
| 629 | </listitem> |
| 630 | </itemizedlist> |
| 631 | |
| 632 | <para>All memory accesses are intercepted and monitored.</para> |
| 633 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 634 | <para>By observing the above events, Helgrind can infer certain |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 635 | aspects of the program's locking discipline. Programs which adhere to |
| 636 | the following rules are considered to be acceptable: |
| 637 | </para> |
| 638 | |
| 639 | <itemizedlist> |
| 640 | <listitem> |
| 641 | <para>A thread may allocate memory, and write initial values into |
| 642 | it, without locking. That thread is regarded as owning the memory |
| 643 | exclusively.</para> |
| 644 | </listitem> |
| 645 | <listitem> |
| 646 | <para>A thread may read and write memory which it owns exclusively, |
| 647 | without locking.</para> |
| 648 | </listitem> |
| 649 | <listitem> |
| 650 | <para>Memory which is owned exclusively by one thread may be read by |
| 651 | that thread and others without locking. However, in this situation |
| 652 | no thread may do unlocked writes to the memory (except for the owner |
| 653 | thread's initializing write).</para> |
| 654 | </listitem> |
| 655 | <listitem> |
| 656 | <para>Memory which is shared between multiple threads, one or more |
| 657 | of which writes to it, must be protected by a lock which is |
| 658 | correctly acquired and released by all threads accessing the |
| 659 | memory.</para> |
| 660 | </listitem> |
| 661 | </itemizedlist> |
| 662 | |
| 663 | <para>Any violation of this discipline will cause an error to be reported. |
| 664 | However, two exemptions apply:</para> |
| 665 | |
| 666 | <itemizedlist> |
| 667 | <listitem> |
| 668 | <para>A thread Y can acquire exclusive ownership of memory |
| 669 | previously owned exclusively by a different thread X providing |
| 670 | X's last access and Y's first access are separated by one of the |
| 671 | following synchronization events:</para> |
| 672 | <itemizedlist> |
| 673 | <listitem><para>X creates thread Y</para></listitem> |
| 674 | <listitem><para>X joins back to Y</para></listitem> |
| 675 | <listitem><para>X uses a condition-variable to signal at Y, and Y is |
| 676 | waiting for that event</para></listitem> |
| 677 | <listitem><para>Y completes a semaphore wait as a result of X signalling |
| 678 | on that same semaphore</para></listitem> |
| 679 | </itemizedlist> |
| 680 | <para> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 681 | This refinement allows Helgrind to correctly track the ownership |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 682 | state of inter-thread buffers used in the worker-thread and |
| 683 | worker-thread-pool concurrent programming idioms (styles).</para> |
| 684 | </listitem> |
| 685 | <listitem> |
| 686 | <para>Similarly, if thread Y joins back to thread X, memory |
| 687 | exclusively owned by Y becomes exclusively owned by X instead. |
| 688 | Also, memory that has been shared only by X and Y becomes |
| 689 | exclusively owned by X. More generally, memory that has been shared |
| 690 | by X, Y and some arbitrary other set S of threads is re-marked as |
| 691 | shared by X and S. Hence, under the right circumstances, memory |
| 692 | shared amongst multiple threads, all of which join into just one, |
| 693 | can revert to the exclusive ownership state.</para> |
| 694 | <para> |
| 695 | In effect, each memory location may make arbitrarily many |
| 696 | transitions between exclusive and shared ownership. Furthermore, a |
| 697 | different lock may protect the location during each period of shared |
| 698 | ownership. This significantly enhances the flexibility of the |
| 699 | algorithm.</para> |
| 700 | </listitem> |
| 701 | </itemizedlist> |
| 702 | |
| 703 | <para>The ownership state, accessing thread-set and related lock-set |
| 704 | for each memory location are tracked at 8-bit granularity. This means |
| 705 | the algorithm is precise even for 16- and 8-bit memory |
| 706 | accesses.</para> |
| 707 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 708 | <para>Helgrind correctly handles reader-writer locks in this |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 709 | framework. Locations shared between multiple threads can be protected |
| 710 | during reads by locks held in either read-mode or write-mode, but can |
| 711 | only be protected during writes by locks held in write-mode. Normal |
| 712 | POSIX mutexes are treated as if they are reader-writer locks which are |
| 713 | only ever held in write-mode.</para> |
| 714 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 715 | <para>Helgrind correctly handles POSIX mutexes for which recursive |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 716 | locking is allowed.</para> |
| 717 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 718 | <para>Helgrind partially correctly handles x86 and amd64 memory access |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 719 | instructions preceded by a LOCK prefix. Writes are correctly handled, |
| 720 | by pretending that the LOCK prefix implies acquisition and release of |
| 721 | a magic "bus hardware lock" mutex before and after the instruction. |
| 722 | This unfortunately requires subsequent reads from such locations to |
| 723 | also use a LOCK prefix, which is not required by the real hardware. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 724 | Helgrind does not offer any equivalent handling for atomic sequences |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 725 | on PowerPC/POWER platforms created by the use of lwarx/stwcx |
| 726 | instructions.</para> |
| 727 | |
| 728 | </sect2> |
| 729 | |
| 730 | |
| 731 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 732 | <sect2 id="hg-manual.data-races.errmsgs" xreflabel="Race Error Messages"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 733 | <title>Interpreting Race Error Messages</title> |
| 734 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 735 | <para>Helgrind's race detection algorithm collects a lot of |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 736 | information, and tries to present it in a helpful way when a race is |
| 737 | detected. Here's an example:</para> |
| 738 | |
| 739 | <programlisting><![CDATA[ |
| 740 | Thread #2 was created |
| 741 | at 0x510548E: clone (in /lib64/libc-2.5.so) |
| 742 | by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so) |
| 743 | by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so) |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 744 | by 0x4C23870: pthread_create@* (hg_intercepts.c:198) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 745 | by 0x400CEF: main (tc17_sembar.c:195) |
| 746 | |
| 747 | // And the same for threads #3, #4 and #5 -- omitted for conciseness |
| 748 | |
| 749 | Possible data race during read of size 4 at 0x602174 |
| 750 | at 0x400BE5: gomp_barrier_wait (tc17_sembar.c:122) |
| 751 | by 0x400C44: child (tc17_sembar.c:161) |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 752 | by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 753 | by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so) |
| 754 | by 0x51054CC: clone (in /lib64/libc-2.5.so) |
| 755 | Old state: shared-modified by threads #2, #3, #4, #5 |
| 756 | New state: shared-modified by threads #2, #3, #4, #5 |
| 757 | Reason: this thread, #2, holds no consistent locks |
| 758 | Last consistently used lock for 0x602174 was first observed |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 759 | at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326) |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 760 | by 0x4009E4: gomp_barrier_init (tc17_sembar.c:46) |
| 761 | by 0x400CBC: main (tc17_sembar.c:192) |
| 762 | ]]></programlisting> |
| 763 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 764 | <para>Helgrind first announces the creation points of any threads |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 765 | referenced in the error message. This is so it can speak concisely |
| 766 | about threads and sets of threads without repeatedly printing their |
| 767 | creation point call stacks. Each thread is only ever announced once, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 768 | the first time it appears in any Helgrind error message.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 769 | |
| 770 | <para>The main error message begins at the text |
| 771 | "<computeroutput>Possible data race during read</computeroutput>". |
| 772 | At the start is information you would expect to see -- address and |
| 773 | size of the racing access, whether a read or a write, and the call |
| 774 | stack at the point it was detected.</para> |
| 775 | |
| 776 | <para>More interesting is the state transition caused by this access. |
| 777 | This memory is already in the shared-modified state, and up to now has |
| 778 | been consistently protected by at least one lock. However, the thread |
| 779 | making the access in question (thread #2, here) does not hold any |
| 780 | locks in common with those held during all previous accesses to the |
| 781 | location -- "no consistent locks", in other words.</para> |
| 782 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 783 | <para>Finally, Helgrind shows the lock which has protected this |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 784 | location in all previous accesses. (If there is more than one, only |
| 785 | one is shown). This can be a useful hint, because it typically shows |
| 786 | the lock that the programmers intended to use to protect the location, |
| 787 | but in this case forgot.</para> |
| 788 | |
| 789 | <para>Here are some more examples of race reports. This not an |
| 790 | exhaustive list of combinations, but should give you some insight into |
| 791 | how to interpret the output.</para> |
| 792 | |
| 793 | <programlisting><![CDATA[ |
| 794 | Possible data race during write ... |
| 795 | Old state: shared-readonly by threads #1, #2, #3 |
| 796 | New state: shared-modified by threads #1, #2, #3 |
| 797 | Reason: this thread, #3, holds no consistent locks |
| 798 | Location ... has never been protected by any lock |
| 799 | ]]></programlisting> |
| 800 | |
| 801 | <para>The location is shared by 3 threads, all of which have been |
| 802 | reading it without locking ("has never been protected by any lock"). |
| 803 | Now one of them is writing it. Regardless of whether the writer has a |
| 804 | lock or not, this is still an error, because the write races against |
| 805 | the previously observed reads.</para> |
| 806 | |
| 807 | <programlisting><![CDATA[ |
| 808 | Possible data race during read ... |
| 809 | Old state: shared-modified by threads #1, #2, #3 |
| 810 | New state: shared-modified by threads #1, #2, #3 |
| 811 | Reason: this thread, #3, holds no consistent locks |
| 812 | Last consistently used lock for ... was first observed ... |
| 813 | ]]></programlisting> |
| 814 | |
| 815 | <para>The location is shared by 3 threads, all of which have been |
| 816 | reading and writing it while (as required) holding at least one lock |
| 817 | in common. Now it is being read without that lock being held. In the |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 818 | "Last consistently used lock" part, Helgrind offers its best guess as |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 819 | to the identity of the lock that should have been used.</para> |
| 820 | |
| 821 | <programlisting><![CDATA[ |
| 822 | Possible data race during write ... |
| 823 | Old state: owned exclusively by thread #4 |
| 824 | New state: shared-modified by threads #4, #5 |
| 825 | Reason: this thread, #5, holds no locks at all |
| 826 | ]]></programlisting> |
| 827 | |
| 828 | <para>A location that has so far been accessed exclusively by thread |
| 829 | #4 has now been written by thread #5, without use of any lock. This |
| 830 | can be a sign that the programmer did not consider the possibility of |
| 831 | the location being shared between threads, or, alternatively, forgot |
| 832 | to use the appropriate lock.</para> |
| 833 | |
| 834 | <para>Note that thread #4 exclusively owns the location, and so has |
| 835 | the right to access it without holding a lock. However, this message |
| 836 | does not say that thread #4 is not using a lock for this location. |
| 837 | Indeed, it could be using a lock for the location because it intends |
| 838 | to make it available to other threads, one of which is thread #5 -- |
| 839 | and thread #5 has forgotten to use the lock.</para> |
| 840 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 841 | <para>Also, this message implies that Helgrind did not see any |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 842 | synchronisation event between threads #4 and #5 that would have |
| 843 | allowed #5 to acquire exclusive ownership from #4. See |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 844 | <link linkend="hg-manual.data-races.exclusive">above</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 845 | for a discussion of transfers of exclusive ownership states between |
| 846 | threads.</para> |
| 847 | |
| 848 | </sect2> |
| 849 | |
| 850 | |
| 851 | </sect1> |
| 852 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 853 | <sect1 id="hg-manual.effective-use" xreflabel="Helgrind Effective Use"> |
| 854 | <title>Hints and Tips for Effective Use of Helgrind</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 855 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 856 | <para>Helgrind can be very helpful in finding and resolving |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 857 | threading-related problems. Like all sophisticated tools, it is most |
| 858 | effective when you understand how to play to its strengths.</para> |
| 859 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 860 | <para>Helgrind will be less effective when you merely throw an |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 861 | existing threaded program at it and try to make sense of any reported |
| 862 | errors. It will be more effective if you design threaded programs |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 863 | from the start in a way that helps Helgrind verify correctness. The |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 864 | same is true for finding memory errors with Memcheck, but applies more |
| 865 | here, because thread checking is a harder problem. Consequently it is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 866 | much easier to write a correct program for which Helgrind falsely |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 867 | reports (threading) errors than it is to write a correct program for |
| 868 | which Memcheck falsely reports (memory) errors.</para> |
| 869 | |
| 870 | <para>With that in mind, here are some tips, listed most important first, |
| 871 | for getting reliable results and avoiding false errors. The first two |
| 872 | are critical. Any violations of them will swamp you with huge numbers |
| 873 | of false data-race errors.</para> |
| 874 | |
| 875 | |
| 876 | <orderedlist> |
| 877 | |
| 878 | <listitem> |
| 879 | <para>Make sure your application, and all the libraries it uses, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 880 | use the POSIX threading primitives. Helgrind needs to be able to |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 881 | see all events pertaining to thread creation, exit, locking and |
sewardj | 3387889 | 2007-11-17 09:43:25 +0000 | [diff] [blame^] | 882 | other synchronisation events. To do so it intercepts many POSIX |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 883 | pthread_ functions.</para> |
| 884 | |
| 885 | <para>Do not roll your own threading primitives (mutexes, etc) |
| 886 | from combinations of the Linux futex syscall, counters and wotnot. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 887 | These throw Helgrind's internal what's-going-on models way off |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 888 | course and will give bogus results.</para> |
| 889 | |
| 890 | <para>Also, do not reimplement existing POSIX abstractions using |
| 891 | other POSIX abstractions. For example, don't build your own |
| 892 | semaphore routines or reader-writer locks from POSIX mutexes and |
| 893 | condition variables. Instead use POSIX reader-writer locks and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 894 | semaphores directly, since Helgrind supports them directly.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 895 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 896 | <para>Helgrind directly supports the following POSIX threading |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 897 | abstractions: mutexes, reader-writer locks, condition variables |
| 898 | (but see below), and semaphores. Currently spinlocks and barriers |
| 899 | are not supported, although they could be in future. A prototype |
| 900 | "safe" implementation of barriers, based on semaphores, is |
| 901 | available: please contact the Valgrind authors for details.</para> |
| 902 | |
| 903 | <para>At the time of writing, the following popular Linux packages |
| 904 | are known to implement their own threading primitives:</para> |
| 905 | |
| 906 | <itemizedlist> |
| 907 | <listitem><para>Qt version 4.X. Qt 3.X is fine, but not 4.X. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 908 | Helgrind contains partial direct support for Qt 4.X threading, |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 909 | but this is not yet in a usable state. Assistance from folks |
| 910 | knowledgeable in Qt 4 threading internals would be |
| 911 | appreciated.</para></listitem> |
| 912 | |
| 913 | <listitem><para>Runtime support library for GNU OpenMP (part of |
| 914 | GCC), at least GCC versions 4.2 and 4.3. With some minor effort |
| 915 | of modifying the GNU OpenMP runtime support sources, it is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 916 | possible to use Helgrind on GNU OpenMP compiled codes. Please |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 917 | contact the Valgrind authors for details.</para></listitem> |
| 918 | </itemizedlist> |
| 919 | </listitem> |
| 920 | |
| 921 | <listitem> |
| 922 | <para>Avoid memory recycling. If you can't avoid it, you must use |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 923 | tell Helgrind what is going on via the VALGRIND_HG_CLEAN_MEMORY |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 924 | client request |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 925 | (in <computeroutput>helgrind.h</computeroutput>).</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 926 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 927 | <para>Helgrind is aware of standard memory allocation and |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 928 | deallocation that occurs via malloc/free/new/delete and from entry |
| 929 | and exit of stack frames. In particular, when memory is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 930 | deallocated via free, delete, or function exit, Helgrind considers |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 931 | that memory clean, so when it is eventually reallocated, its |
| 932 | history is irrelevant.</para> |
| 933 | |
| 934 | <para>However, it is common practice to implement memory recycling |
| 935 | schemes. In these, memory to be freed is not handed to |
| 936 | malloc/delete, but instead put into a pool of free buffers to be |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 937 | handed out again as required. The problem is that Helgrind has no |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 938 | way to know that such memory is logically no longer in use, and |
| 939 | its history is irrelevant. Hence you must make that explicit, |
| 940 | using the VALGRIND_HG_CLEAN_MEMORY client request to specify the |
| 941 | relevant address ranges. It's easiest to put these requests into |
| 942 | the pool manager code, and use them either when memory is returned |
| 943 | to the pool, or is allocated from it.</para> |
| 944 | </listitem> |
| 945 | |
| 946 | <listitem> |
| 947 | <para>Avoid POSIX condition variables. If you can, use POSIX |
| 948 | semaphores (sem_t, sem_post, sem_wait) to do inter-thread event |
| 949 | signalling. Semaphores with an initial value of zero are |
| 950 | particularly useful for this.</para> |
| 951 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 952 | <para>Helgrind only partially correctly handles POSIX condition |
| 953 | variables. This is because Helgrind can see inter-thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 954 | dependencies between a pthread_cond_wait call and a |
| 955 | pthread_cond_signal/broadcast call only if the waiting thread |
| 956 | actually gets to the rendezvous first (so that it actually calls |
| 957 | pthread_cond_wait). It can't see dependencies between the threads |
| 958 | if the signaller arrives first. In the latter case, POSIX |
| 959 | guidelines imply that the associated boolean condition still |
| 960 | provides an inter-thread synchronisation event, but one which is |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 961 | invisible to Helgrind.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 962 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 963 | <para>The result of Helgrind missing some inter-thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 964 | synchronisation events is to cause it to report false positives. |
| 965 | That's because missing such events reduces the extent to which it |
| 966 | can transfer exclusive memory ownership between threads. So |
| 967 | memory may end up in a shared-modified state when that was not |
| 968 | intended by the application programmers.</para> |
| 969 | |
| 970 | <para>The root cause of this synchronisation lossage is |
| 971 | particularly hard to understand, so an example is helpful. It was |
| 972 | discussed at length by Arndt Muehlenfeld ("Runtime Race Detection |
| 973 | in Multi-Threaded Programs", Dissertation, TU Graz, Austria). The |
| 974 | canonical POSIX-recommended usage scheme for condition variables |
| 975 | is as follows:</para> |
| 976 | |
| 977 | <programlisting><![CDATA[ |
| 978 | b is a Boolean condition, which is False most of the time |
| 979 | cv is a condition variable |
| 980 | mx is its associated mutex |
| 981 | |
| 982 | Signaller: Waiter: |
| 983 | |
| 984 | lock(mx) lock(mx) |
| 985 | b = True while (b == False) |
| 986 | signal(cv) wait(cv,mx) |
| 987 | unlock(mx) unlock(mx) |
| 988 | ]]></programlisting> |
| 989 | |
| 990 | <para>Assume <computeroutput>b</computeroutput> is False most of |
| 991 | the time. If the waiter arrives at the rendezvous first, it |
| 992 | enters its while-loop, waits for the signaller to signal, and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 993 | eventually proceeds. Helgrind sees the signal, notes the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 994 | dependency, and all is well.</para> |
| 995 | |
| 996 | <para>If the signaller arrives |
| 997 | first, <computeroutput>b</computeroutput> is set to true, and the |
| 998 | signal disappears into nowhere. When the waiter later arrives, it |
| 999 | does not enter its while-loop and simply carries on. But even in |
| 1000 | this case, the waiter code following the while-loop cannot execute |
| 1001 | until the signaller sets <computeroutput>b</computeroutput> to |
| 1002 | True. Hence there is still the same inter-thread dependency, but |
| 1003 | this time it is through an arbitrary in-memory condition, and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1004 | Helgrind cannot see it.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1005 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1006 | <para>By comparison, Helgrind's detection of inter-thread |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1007 | dependencies caused by semaphore operations is believed to be |
| 1008 | exactly correct.</para> |
| 1009 | |
| 1010 | <para>As far as I know, a solution to this problem that does not |
| 1011 | require source-level annotation of condition-variable wait loops |
| 1012 | is beyond the current state of the art.</para> |
| 1013 | </listitem> |
| 1014 | |
| 1015 | <listitem> |
| 1016 | <para>Make sure you are using a supported Linux distribution. At |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1017 | present, Helgrind only properly supports x86-linux and amd64-linux |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1018 | with glibc-2.3 or later. The latter restriction means we only |
| 1019 | support glibc's NPTL threading implementation. The old |
| 1020 | LinuxThreads implementation is not supported.</para> |
| 1021 | |
| 1022 | <para>Unsupported targets may work to varying degrees. In |
| 1023 | particular ppc32-linux and ppc64-linux running NTPL should work, |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1024 | but you will get false race errors because Helgrind does not know |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1025 | how to properly handle atomic instruction sequences created using |
| 1026 | the lwarx/stwcx instructions.</para> |
| 1027 | </listitem> |
| 1028 | |
| 1029 | <listitem> |
| 1030 | <para>Round up all finished threads using pthread_join. Avoid |
| 1031 | detaching threads: don't create threads in the detached state, and |
| 1032 | don't call pthread_detach on existing threads.</para> |
| 1033 | |
| 1034 | <para>Using pthread_join to round up finished threads provides a |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1035 | clear synchronisation point that both Helgrind and programmers can |
| 1036 | see. This synchronisation point allows Helgrind to adjust its |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1037 | memory ownership |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1038 | models <link linkend="hg-manual.data-races.exclusive">as described |
| 1039 | extensively above</link>, which helps Helgrind produce more |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1040 | accurate error reports.</para> |
| 1041 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1042 | <para>If you don't call pthread_join on a thread, Helgrind has no |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1043 | way to know when it finishes, relative to any significant |
| 1044 | synchronisation points for other threads in the program. So it |
| 1045 | assumes that the thread lingers indefinitely and can potentially |
| 1046 | interfere indefinitely with the memory state of the program. It |
| 1047 | has every right to assume that -- after all, it might really be |
| 1048 | the case that, for scheduling reasons, the exiting thread did run |
| 1049 | very slowly in the last stages of its life.</para> |
| 1050 | </listitem> |
| 1051 | |
| 1052 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1053 | <para>Perform thread debugging (with Helgrind) and memory |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1054 | debugging (with Memcheck) together.</para> |
| 1055 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1056 | <para>Helgrind tracks the state of memory in detail, and memory |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1057 | management bugs in the application are liable to cause confusion. |
| 1058 | In extreme cases, applications which do many invalid reads and |
| 1059 | writes (particularly to freed memory) have been known to crash |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1060 | Helgrind. So, ideally, you should make your application |
| 1061 | Memcheck-clean before using Helgrind.</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1062 | |
| 1063 | <para>It may be impossible to make your application Memcheck-clean |
| 1064 | unless you first remove threading bugs. In particular, it may be |
| 1065 | difficult to remove all reads and writes to freed memory in |
| 1066 | multithreaded C++ destructor sequences at program termination. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1067 | So, ideally, you should make your application Helgrind-clean |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1068 | before using Memcheck.</para> |
| 1069 | |
| 1070 | <para>Since this circularity is obviously unresolvable, at least |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1071 | bear in mind that Memcheck and Helgrind are to some extent |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1072 | complementary, and you may need to use them together.</para> |
| 1073 | </listitem> |
| 1074 | |
| 1075 | <listitem> |
| 1076 | <para>POSIX requires that implementations of standard I/O (printf, |
| 1077 | fprintf, fwrite, fread, etc) are thread safe. Unfortunately GNU |
| 1078 | libc implements this by using internal locking primitives that |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1079 | Helgrind is unable to intercept. Consequently Helgrind generates |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1080 | many false race reports when you use these functions.</para> |
| 1081 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1082 | <para>Helgrind attempts to hide these errors using the standard |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1083 | Valgrind error-suppression mechanism. So, at least for simple |
| 1084 | test cases, you don't see any. Nevertheless, some may slip |
| 1085 | through. Just something to be aware of.</para> |
| 1086 | </listitem> |
| 1087 | |
| 1088 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1089 | <para>Helgrind's error checks do not work properly inside the |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1090 | system threading library itself |
| 1091 | (<computeroutput>libpthread.so</computeroutput>), and it usually |
| 1092 | observes large numbers of (false) errors in there. Valgrind's |
| 1093 | suppression system then filters these out, so you should not see |
| 1094 | them.</para> |
| 1095 | |
| 1096 | <para>If you see any race errors reported |
| 1097 | where <computeroutput>libpthread.so</computeroutput> or |
| 1098 | <computeroutput>ld.so</computeroutput> is the object associated |
| 1099 | with the innermost stack frame, please file a bug report at |
| 1100 | http://www.valgrind.org.</para> |
| 1101 | </listitem> |
| 1102 | |
| 1103 | </orderedlist> |
| 1104 | |
| 1105 | </sect1> |
| 1106 | |
| 1107 | |
| 1108 | |
| 1109 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1110 | <sect1 id="hg-manual.options" xreflabel="Helgrind Options"> |
| 1111 | <title>Helgrind Options</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1112 | |
| 1113 | <para>The following end-user options are available:</para> |
| 1114 | |
| 1115 | <!-- start of xi:include in the manpage --> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1116 | <variablelist id="hg.opts.list"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1117 | |
| 1118 | <varlistentry id="opt.happens-before" xreflabel="--happens-before"> |
| 1119 | <term> |
| 1120 | <option><![CDATA[--happens-before=none|threads|all |
| 1121 | [default: all] ]]></option> |
| 1122 | </term> |
| 1123 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1124 | <para>Helgrind always regards locks as the basis for |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1125 | inter-thread synchronisation. However, by default, before |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1126 | reporting a race error, Helgrind will also check whether |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1127 | certain other kinds of inter-thread synchronisation events |
| 1128 | happened. It may be that if such events took place, then no |
| 1129 | race really occurred, and so no error needs to be reported. |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1130 | See <link linkend="hg-manual.data-races.exclusive">above</link> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1131 | for a discussion of transfers of exclusive ownership states |
| 1132 | between threads. |
| 1133 | </para> |
| 1134 | <para>With <varname>--happens-before=all</varname>, the |
| 1135 | following events are regarded as sources of synchronisation: |
| 1136 | thread creation/joinage, condition variable |
| 1137 | signal/broadcast/waits, and semaphore posts/waits. |
| 1138 | </para> |
| 1139 | <para>With <varname>--happens-before=threads</varname>, only |
| 1140 | thread creation/joinage events are regarded as sources of |
| 1141 | synchronisation. |
| 1142 | </para> |
| 1143 | <para>With <varname>--happens-before=none</varname>, no events |
| 1144 | (apart, of course, from locking) are regarded as sources of |
| 1145 | synchronisation. |
| 1146 | </para> |
| 1147 | <para>Changing this setting from the default will increase your |
| 1148 | false-error rate but give little or no gain. The only advantage |
| 1149 | is that <option>--happens-before=threads</option> and |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1150 | <option>--happens-before=none</option> should make Helgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1151 | less and less sensitive to the scheduling of threads, and hence |
| 1152 | the output more and more repeatable across runs. |
| 1153 | </para> |
| 1154 | </listitem> |
| 1155 | </varlistentry> |
| 1156 | |
| 1157 | <varlistentry id="opt.trace-addr" xreflabel="--trace-addr"> |
| 1158 | <term> |
| 1159 | <option><![CDATA[--trace-addr=0xXXYYZZ |
| 1160 | ]]></option> and |
| 1161 | <option><![CDATA[--trace-level=0|1|2 [default: 1] |
| 1162 | ]]></option> |
| 1163 | </term> |
| 1164 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1165 | <para>Requests that Helgrind produces a log of all state changes |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1166 | to location 0xXXYYZZ. This can be helpful in tracking down |
| 1167 | tricky races. <varname>--trace-level</varname> controls the |
| 1168 | verbosity of the log. At the default setting (1), a one-line |
| 1169 | summary of is printed for each state change. At level 2 a |
| 1170 | complete stack trace is printed for each state change.</para> |
| 1171 | </listitem> |
| 1172 | </varlistentry> |
| 1173 | |
| 1174 | </variablelist> |
| 1175 | <!-- end of xi:include in the manpage --> |
| 1176 | |
| 1177 | <!-- start of xi:include in the manpage --> |
| 1178 | <para>In addition, the following debugging options are available for |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1179 | Helgrind:</para> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1180 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1181 | <variablelist id="hg.debugopts.list"> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1182 | |
| 1183 | <varlistentry id="opt.trace-malloc" xreflabel="--trace-malloc"> |
| 1184 | <term> |
| 1185 | <option><![CDATA[--trace-malloc=no|yes [no] |
| 1186 | ]]></option> |
| 1187 | </term> |
| 1188 | <listitem> |
| 1189 | <para>Show all client malloc (etc) and free (etc) requests.</para> |
| 1190 | </listitem> |
| 1191 | </varlistentry> |
| 1192 | |
| 1193 | <varlistentry id="opt.gen-vcg" xreflabel="--gen-vcg"> |
| 1194 | <term> |
| 1195 | <option><![CDATA[--gen-vcg=no|yes|yes-w-vts [no] |
| 1196 | ]]></option> |
| 1197 | </term> |
| 1198 | <listitem> |
| 1199 | <para>At exit, write to stderr a dump of the happens-before |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1200 | graph computed by Helgrind, in a format suitable for the VCG |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1201 | graph visualisation tool. A suitable command line is:</para> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1202 | <para><computeroutput>valgrind --tool=helgrind |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1203 | --gen-vcg=yes my_app 2>&1 |
| 1204 | | grep xxxxxx | sed "s/xxxxxx//g" |
| 1205 | | xvcg -</computeroutput></para> |
| 1206 | <para>With <varname>--gen-vcg=yes</varname>, the basic |
| 1207 | happens-before graph is shown. With |
| 1208 | <varname>--gen-vcg=yes-w-vts</varname>, the vector timestamp |
| 1209 | for each node is also shown.</para> |
| 1210 | </listitem> |
| 1211 | </varlistentry> |
| 1212 | |
| 1213 | <varlistentry id="opt.cmp-race-err-addrs" |
| 1214 | xreflabel="--cmp-race-err-addrs"> |
| 1215 | <term> |
| 1216 | <option><![CDATA[--cmp-race-err-addrs=no|yes [no] |
| 1217 | ]]></option> |
| 1218 | </term> |
| 1219 | <listitem> |
| 1220 | <para>Controls whether or not race (data) addresses should be |
| 1221 | taken into account when removing duplicates of race errors. |
| 1222 | With <varname>--cmp-race-err-addrs=no</varname>, two otherwise |
| 1223 | identical race errors will be considered to be the same if |
| 1224 | their race addresses differ. With |
| 1225 | With <varname>--cmp-race-err-addrs=yes</varname> they will be |
| 1226 | considered different. This is provided to help make certain |
| 1227 | regression tests work reliably.</para> |
| 1228 | </listitem> |
| 1229 | </varlistentry> |
| 1230 | |
| 1231 | <varlistentry id="opt.tc-sanity-flags" xreflabel="--tc-sanity-flags"> |
| 1232 | <term> |
| 1233 | <option><![CDATA[--tc-sanity-flags=<XXXXX> (X = 0|1) [00000] |
| 1234 | ]]></option> |
| 1235 | </term> |
| 1236 | <listitem> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1237 | <para>Run extensive sanity checks on Helgrind's internal |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1238 | data structures at events defined by the bitstring, as |
| 1239 | follows:</para> |
| 1240 | <para><computeroutput>10000 </computeroutput>after changes to |
| 1241 | the lock order acquisition graph</para> |
| 1242 | <para><computeroutput>01000 </computeroutput>after every client |
| 1243 | memory access (NB: not currently used)</para> |
| 1244 | <para><computeroutput>00100 </computeroutput>after every client |
| 1245 | memory range permission setting of 256 bytes or greater</para> |
| 1246 | <para><computeroutput>00010 </computeroutput>after every client |
| 1247 | lock or unlock event</para> |
| 1248 | <para><computeroutput>00001 </computeroutput>after every client |
| 1249 | thread creation or joinage event</para> |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1250 | <para>Note these will make Helgrind run very slowly, often to |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1251 | the point of being completely unusable.</para> |
| 1252 | </listitem> |
| 1253 | </varlistentry> |
| 1254 | |
| 1255 | </variablelist> |
| 1256 | <!-- end of xi:include in the manpage --> |
| 1257 | |
| 1258 | |
| 1259 | </sect1> |
| 1260 | |
sewardj | 572feb7 | 2007-11-09 23:59:49 +0000 | [diff] [blame] | 1261 | <sect1 id="hg-manual.todolist" xreflabel="To Do List"> |
| 1262 | <title>A To-Do List for Helgrind</title> |
sewardj | b411202 | 2007-11-09 22:49:28 +0000 | [diff] [blame] | 1263 | |
| 1264 | <para>The following is a list of loose ends which should be tidied up |
| 1265 | some time.</para> |
| 1266 | |
| 1267 | <itemizedlist> |
| 1268 | <listitem><para>Track which mutexes are associated with which |
| 1269 | condition variables, and emit a warning if this becomes |
| 1270 | inconsistent.</para> |
| 1271 | </listitem> |
| 1272 | <listitem><para>For lock order errors, print the complete lock |
| 1273 | cycle, rather than only doing for size-2 cycles as at |
| 1274 | present.</para> |
| 1275 | </listitem> |
| 1276 | <listitem><para>Document the VALGRIND_HG_CLEAN_MEMORY client |
| 1277 | request.</para> |
| 1278 | </listitem> |
| 1279 | <listitem><para>Possibly a client request to forcibly transfer |
| 1280 | ownership of memory from one thread to another. Requires further |
| 1281 | consideration.</para> |
| 1282 | </listitem> |
| 1283 | <listitem><para>Add a new client request that marks an address range |
| 1284 | as being "shared-modified with empty lockset" (the error state), |
| 1285 | and describe how to use it.</para> |
| 1286 | </listitem> |
| 1287 | <listitem><para>Document races caused by gcc's thread-unsafe code |
| 1288 | generation for speculative stores. In the interim see |
| 1289 | <computeroutput>http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html |
| 1290 | </computeroutput> |
| 1291 | and <computeroutput>http://lkml.org/lkml/2007/10/24/673</computeroutput>. |
| 1292 | </para> |
| 1293 | </listitem> |
| 1294 | <listitem><para>Don't update the lock-order graph, and don't check |
| 1295 | for errors, when a "try"-style lock operation happens (eg |
| 1296 | pthread_mutex_trylock). Such calls do not add any real |
| 1297 | restrictions to the locking order, since they can always fail to |
| 1298 | acquire the lock, resulting in the caller going off and doing Plan |
| 1299 | B (presumably it will have a Plan B). Doing such checks could |
| 1300 | generate false lock-order errors and confuse users.</para> |
| 1301 | </listitem> |
| 1302 | <listitem><para> Performance can be very poor. Slowdowns on the |
| 1303 | order of 100:1 are not unusual. There is quite some scope for |
| 1304 | performance improvements, though. |
| 1305 | </para> |
| 1306 | </listitem> |
| 1307 | |
| 1308 | </itemizedlist> |
| 1309 | |
| 1310 | </sect1> |
| 1311 | |
| 1312 | </chapter> |