Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1 | page.title=SMP Primer for Android |
Joe Fernandez | 33baa5a | 2013-11-14 11:41:19 -0800 | [diff] [blame] | 2 | page.tags=ndk,native |
Scott Main | 1c2dea0 | 2013-04-10 18:59:29 -0700 | [diff] [blame] | 3 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 4 | page.article=true |
| 5 | @jd:body |
| 6 | |
| 7 | <div id="tb-wrapper"> |
| 8 | <div id="tb"> |
| 9 | <h2>In this document</h2> |
| 10 | <ol class="nolist"> |
| 11 | <li><a href="#theory">Theory</a> |
| 12 | <ol class="nolist"> |
| 13 | <li style="margin: 3px 0 0"><a href="#mem_consistency">Memory consistency models</a> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 14 | </li> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 15 | <li style="margin:3px 0 0"><a href="#racefree">Data-race-free programming</a> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 16 | <ol class="nolist"> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 17 | <li style="margin:0"><a href="#dataraces">What's a "data race"?</a></li> |
| 18 | <li style="margin:0"><a href="#avoiding">Avoiding data races</a></li> |
| 19 | <li style="margin:0"><a href="#reordering">When memory reordering becomes visible</a></li> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 20 | </ol> |
| 21 | </li> |
| 22 | </ol> |
| 23 | </li> |
| 24 | <li><a href="#practice">Practice</a> |
| 25 | <ol class="nolist"> |
| 26 | <li style="margin:3px 0 0"><a href="#c_dont">What not to do in C</a> |
| 27 | <ol class="nolist"> |
| 28 | <li style="margin:0"><a href="#volatile">C/C++ and “volatile”</a></li> |
| 29 | <li style="margin:0"><a href="#examplesc">Examples</a></li> |
| 30 | </ol> |
| 31 | </li> |
| 32 | <li style="margin:3px 0 0"><a href="#j_dont">What not to do in Java</a> |
| 33 | <ol class="nolist"> |
| 34 | <li style="margin:0"><a href="#sync_volatile">“synchronized” and “volatile”</a></li> |
| 35 | <li style="margin:0"><a href="#examplesj">Examples</a></li> |
| 36 | </ol> |
| 37 | </li> |
| 38 | <li style="margin:3px 0 0"><a href="#bestpractice">What to do</a> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 39 | </li> |
| 40 | </ol> |
| 41 | </li> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 42 | <li><a href="#weak">A little more about weak memory orders</a> |
| 43 | <ol class="nolist"> |
| 44 | <li style="margin:0"><a href="#nonracing">Non-racing accesses</a></li> |
| 45 | <li style="margin:0"><a href="#hint_only">Result is not relied upon for correctness</a></li> |
| 46 | <li style="margin:0"><a href="#unread">Atomically modified but unread data</a></li> |
| 47 | <li style="margin:0"><a href="#flag">Simple flag communication</a></li> |
| 48 | <li style="margin:0"><a href="#immutable">Immutable fields</a></li> |
| 49 | </ol> |
| 50 | </li> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 51 | <li><a href="#closing_notes">Closing Notes</a></li> |
| 52 | <li><a href="#appendix">Appendix</a> |
| 53 | <ol class="nolist"> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 54 | <li style="margin:0"><a href="#sync_stores">Implementing synchronization stores</a></li> |
| 55 | <li style="margin:0"><a href="#more">Further reading</a></li> |
| 56 | </ol> |
| 57 | </li> |
| 58 | </ol> |
| 59 | </div> |
| 60 | </div> |
| 61 | |
| 62 | <p>Android 3.0 and later platform versions are optimized to support |
| 63 | multiprocessor architectures. This document introduces issues that |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 64 | can arise when writing multithreaded code for symmetric multiprocessor systems in C, C++, and the Java |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 65 | programming language (hereafter referred to simply as “Java” for the sake of |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 66 | brevity). It's intended as a primer for Android app developers, not as a complete |
| 67 | discussion on the subject.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 68 | |
| 69 | <h2 id="intro">Introduction</h2> |
| 70 | |
| 71 | <p>SMP is an acronym for “Symmetric Multi-Processor”. It describes a design in |
| 72 | which two or more identical CPU cores share access to main memory. Until |
| 73 | a few years ago, all Android devices were UP (Uni-Processor).</p> |
| 74 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 75 | <p>Most — if not all — Android devices always had multiple CPUs, but |
| 76 | in the past only one of them was used to run applications while others manage various bits of device |
| 77 | hardware (for example, the radio). The CPUs may have had different architectures, and the |
| 78 | programs running on them couldn’t use main memory to communicate with each |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 79 | other.</p> |
| 80 | |
| 81 | <p>Most Android devices sold today are built around SMP designs, |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 82 | making things a bit more complicated for software developers. Race conditions |
| 83 | in a multi-threaded program may not cause visible problems on a uniprocessor, |
| 84 | but may fail regularly when two or more of your threads |
| 85 | are running simultaneously on different cores. |
| 86 | What’s more, code may be more or less prone to failures when run on different |
| 87 | processor architectures, or even on different implementations of the same |
| 88 | architecture. Code that has been thoroughly tested on x86 may break badly on ARM. |
| 89 | Code may start to fail when recompiled with a more modern compiler.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 90 | |
| 91 | <p>The rest of this document will explain why, and tell you what you need to do |
| 92 | to ensure that your code behaves correctly.</p> |
| 93 | |
| 94 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 95 | <h2 id="theory">Memory consistency models: Why SMPs are a bit different</h2> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 96 | |
| 97 | <p>This is a high-speed, glossy overview of a complex subject. Some areas will |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 98 | be incomplete, but none of it should be misleading or wrong. As you |
| 99 | will see in the next section, the details here are usually not important.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 100 | |
| 101 | <p>See <a href="#more">Further reading</a> at the end of the document for |
| 102 | pointers to more thorough treatments of the subject.</p> |
| 103 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 104 | <p>Memory consistency models, or often just “memory models”, describe the |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 105 | guarantees the programming language or hardware architecture |
| 106 | makes about memory accesses. For example, |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 107 | if you write a value to address A, and then write a value to address B, the |
| 108 | model might guarantee that every CPU core sees those writes happen in that |
| 109 | order.</p> |
| 110 | |
| 111 | <p>The model most programmers are accustomed to is <em>sequential |
| 112 | consistency</em>, which is described like this <span |
| 113 | style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Adve & |
| 114 | Gharachorloo</a>)</span>:</p> |
| 115 | |
| 116 | <ul> |
| 117 | <li>All memory operations appear to execute one at a time</li> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 118 | <li>All operations in a single thread appear to execute in the order described |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 119 | by that processor's program.</li> |
| 120 | </ul> |
| 121 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 122 | <p>Let's assume temporarily that we have a very simple compiler or interpreter |
| 123 | that introduces no surprises: It translates |
| 124 | assignments in the source code to load and store instructions in exactly the |
| 125 | corresponding order, one instruction per access. We'll also assume for |
| 126 | simplicity that each thread executes on its own processor. |
| 127 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 128 | <p>If you look at a bit of code and see that it does some reads and writes from |
| 129 | memory, on a sequentially-consistent CPU architecture you know that the code |
| 130 | will do those reads and writes in the expected order. It’s possible that the |
| 131 | CPU is actually reordering instructions and delaying reads and writes, but there |
| 132 | is no way for code running on the device to tell that the CPU is doing anything |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 133 | other than execute instructions in a straightforward manner. (We’ll ignore |
| 134 | memory-mapped device driver I/O.)</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 135 | |
| 136 | <p>To illustrate these points it’s useful to consider small snippets of code, |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 137 | commonly referred to as <em>litmus tests</em>.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 138 | |
| 139 | <p>Here’s a simple example, with code running on two threads:</p> |
| 140 | |
| 141 | <table> |
| 142 | <tr> |
| 143 | <th>Thread 1</th> |
| 144 | <th>Thread 2</th> |
| 145 | </tr> |
| 146 | <tr> |
| 147 | <td><code>A = 3<br /> |
| 148 | B = 5</code></td> |
| 149 | <td><code>reg0 = B<br /> |
| 150 | reg1 = A</code></td> |
| 151 | </tr> |
| 152 | </table> |
| 153 | |
| 154 | <p>In this and all future litmus examples, memory locations are represented by |
| 155 | capital letters (A, B, C) and CPU registers start with “reg”. All memory is |
| 156 | initially zero. Instructions are executed from top to bottom. Here, thread 1 |
| 157 | stores the value 3 at location A, and then the value 5 at location B. Thread 2 |
| 158 | loads the value from location B into reg0, and then loads the value from |
| 159 | location A into reg1. (Note that we’re writing in one order and reading in |
| 160 | another.)</p> |
| 161 | |
| 162 | <p>Thread 1 and thread 2 are assumed to execute on different CPU cores. You |
| 163 | should <strong>always</strong> make this assumption when thinking about |
| 164 | multi-threaded code.</p> |
| 165 | |
| 166 | <p>Sequential consistency guarantees that, after both threads have finished |
| 167 | executing, the registers will be in one of the following states:</p> |
| 168 | |
| 169 | |
| 170 | <table> |
| 171 | <tr> |
| 172 | <th>Registers</th> |
| 173 | <th>States</th> |
| 174 | </tr> |
| 175 | <tr> |
| 176 | <td>reg0=5, reg1=3</td> |
| 177 | <td>possible (thread 1 ran first)</td> |
| 178 | </tr> |
| 179 | <tr> |
| 180 | <td>reg0=0, reg1=0</td> |
| 181 | <td>possible (thread 2 ran first)</td> |
| 182 | </tr> |
| 183 | <tr> |
| 184 | <td>reg0=0, reg1=3</td> |
| 185 | <td>possible (concurrent execution)</td> |
| 186 | </tr> |
| 187 | <tr> |
| 188 | <td>reg0=5, reg1=0</td> |
| 189 | <td>never</td> |
| 190 | </tr> |
| 191 | </table> |
| 192 | |
| 193 | <p>To get into a situation where we see B=5 before we see the store to A, either |
| 194 | the reads or the writes would have to happen out of order. On a |
| 195 | sequentially-consistent machine, that can’t happen.</p> |
| 196 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 197 | <p>Uni-processors, including x86 and ARM, are normally sequentially consistent. |
| 198 | Threads appear to execute in interleaved fashion, as the OS kernel switches |
| 199 | between them. Most SMP systems, including x86 and ARM, |
| 200 | are not sequentially consistent. For example, it is common for |
| 201 | hardware to buffer stores on their way to memory, so that they |
| 202 | don't immediately reach memory and become visible to other cores.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 203 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 204 | <p>The details vary substantially. For example, x86, though not sequentially |
| 205 | consistent, still guarantees that reg0 = 5 and reg1 = 0 remains impossible. |
| 206 | Stores are buffered, but their order is maintained. |
| 207 | ARM, on the other hand, does not. The order of buffered stores is not |
| 208 | maintained, and stores may not reach all other cores at the same time. |
| 209 | These differences are important to assembly programmers. |
| 210 | However, as we will see below, C, C++, or Java programmers can |
| 211 | and should program in a way that hides such architectural differences.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 212 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 213 | <p>So far, we've unrealistically assumed that it is only the hardware that |
| 214 | reorders instructions. In reality, the compiler also reorders instructions to |
| 215 | improve performance. In our example, the compiler might decide that some later |
| 216 | code in Thread 2 needed the value of reg1 before it needed reg0, and thus load |
| 217 | reg1 first. Or some prior code may already have loaded A, and the compiler |
| 218 | might decide to reuse that value instead of loading A again. In either case, |
| 219 | the loads to reg0 and reg1 might be reordered.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 220 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 221 | <p>Reordering accesses to different memory locations, |
| 222 | either in the hardware, or in the compiler, is |
| 223 | allowed, since it doesn't affect the execution of a single thread, and |
| 224 | it can significantly improve performance. As we will see, with a bit of care, |
| 225 | we can also prevent it from affecting the results of multithreaded programs.</p> |
| 226 | |
| 227 | <p>Since compilers can also reorder memory accesses, this problem is actually |
| 228 | not new to SMPs. Even on a uniprocessor, a compiler could reorder the loads to |
| 229 | reg0 and reg1 in our example, and Thread 1 could be scheduled between the |
| 230 | reordered instructions. But if our compiler happened to not reorder, we might |
| 231 | never observe this problem. On most ARM SMPs, even without compiler |
| 232 | reordering, the reordering will probably be seen, possibly after a very large |
| 233 | number of successful executions. Unless you're programming in assembly |
| 234 | language, SMPs generally just make it more likely you'll see problems that were |
| 235 | there all along.</p> |
| 236 | |
| 237 | <h2 id="racefree">Data-race-free programming</h2> |
| 238 | |
| 239 | <p>Fortunately, there is usually an easy way to avoid thinking about any of |
| 240 | these details. If you follow some straightforward rules, it's usually safe |
| 241 | to forget all of the preceding section except the "sequential consistency" part. |
| 242 | Unfortunately, the other complications may become visible if you |
| 243 | accidentally violate those rules. |
| 244 | |
| 245 | <p>Modern programming languages encourage what's known as a "data-race-free" |
| 246 | programming style. So long as you promise not to introduce "data races", |
| 247 | and avoid a handful of constructs that tell the compiler otherwise, the compiler |
| 248 | and hardware promise to provide sequentially consistent results. This doesn't |
| 249 | really mean they avoid memory access reordering. It does mean that if you |
| 250 | follow the rules you won't be able to tell that memory accesses are being |
| 251 | reordered. It's a lot like telling you that sausage is a delicious and |
| 252 | appetizing food, so long as you promise not to visit the |
| 253 | sausage factory. Data races are what expose the ugly truth about memory |
| 254 | reordering.</p> |
| 255 | |
| 256 | <h3 id="dataraces">What's a "data race"?</h3> |
| 257 | |
| 258 | <p>A <i>data race</i> occurs when at least two threads simultaneously access |
| 259 | the same ordinary data, and at least one of them modifies it. By "ordinary |
| 260 | data" we mean something that's not specifically a synchronization object |
| 261 | intended for thread communication. Mutexes, condition variables, Java |
| 262 | volatiles, or C++ atomic objects are not ordinary data, and their accesses |
| 263 | are allowed to race. In fact they are used to prevent data races on other |
| 264 | objects.</p> |
| 265 | |
| 266 | <p>In order to determine whether two threads simultaneously access the same |
| 267 | memory location, we can ignore the memory-reordering discussion from above, and |
| 268 | assume sequential consistency. The following program doesn't have a data race |
| 269 | if <code>A</code> and <code>B</code> are ordinary boolean variables that are |
| 270 | initially false:</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 271 | |
| 272 | <table> |
| 273 | <tr> |
| 274 | <th>Thread 1</th> |
| 275 | <th>Thread 2</th> |
| 276 | </tr> |
| 277 | <tr> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 278 | <td><code>if (A) B = true</code></td> |
| 279 | <td><code>if (B) A = true</code></td> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 280 | </tr> |
| 281 | </table> |
| 282 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 283 | <p>Since operations are not reordered, both conditions will evaluate to false, and |
| 284 | neither variable is ever updated. Thus there cannot be a data race. There is |
| 285 | no need to think about what might happen if the load from <code>A</code> |
| 286 | and store to <code>B</code> in |
| 287 | Thread 1 were somehow reordered. The compiler is not allowed to reorder Thread |
| 288 | 1 by rewriting it as "<code>B = true; if (!A) B = false</code>". That would be |
| 289 | like making sausage in the middle of town in broad daylight. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 290 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 291 | <p>Data races are officially defined on basic built-in types like integers and |
| 292 | references or pointers. Assigning to an <code>int</code> while simultaneously |
| 293 | reading it in another thread is clearly a data race. But both the C++ |
| 294 | standard library and |
| 295 | the Java Collections libraries are written to allow you to also reason about |
| 296 | data races at the library level. They promise to not introduce data races |
| 297 | unless there are concurrent accesses to the same container, at least one of |
| 298 | which updates it. Updating a <code>set<T></code> in one thread while |
| 299 | simultaneously reading it in another allows the library to introduce a |
| 300 | data race, and can thus be thought of informally as a "library-level data race". |
| 301 | Conversely, updating one <code>set<T></code> in one thread, while reading |
| 302 | a different one in another, does not result in a data race, because the |
| 303 | library promises not to introduce a (low-level) data race in that case. |
| 304 | |
| 305 | <p>Normally concurrent accesses to different fields in a data structure |
| 306 | cannot introduce a data race. However there is one important exception to |
| 307 | this rule: Contiguous sequences of bit-fields in C or C++ are treated as |
| 308 | a single "memory location". Accessing any bit-field in such a sequence |
| 309 | is treated as accessing all of them for purposes of determining the |
| 310 | existence of a data race. This reflects the inability of common hardware |
| 311 | to update individual bits without also reading and re-writing adjacent bits. |
| 312 | Java programmers have no analogous concerns.</p> |
| 313 | |
| 314 | <h3 id="avoiding">Avoiding data races</h3> |
| 315 | |
| 316 | Modern programming languages provide a number of synchronization |
| 317 | mechanisms to avoid data races. The most basic tools are: |
| 318 | |
| 319 | <dl> |
| 320 | <dt>Locks or Mutexes</dt> |
| 321 | |
| 322 | <dd>Mutexes (C++11 <code>std::mutex</code>, or <code>pthread_mutex_t</code>), or |
| 323 | <code>synchronized</code> blocks in Java can be used to ensure that certain |
| 324 | section of code do not run concurrently with other sections of code accessing |
| 325 | the same data. We'll refer to these and other similar facilities generically |
| 326 | as "locks." Consistently acquiring a specific lock before accessing a shared |
| 327 | data structure and releasing it afterwards, prevents data races when accessing |
| 328 | the data structure. It also ensures that updates and accesses are atomic, i.e. no |
| 329 | other update to the data structure can run in the middle. This is deservedly |
| 330 | by far the most common tool for preventing data races. The use of Java |
| 331 | <code>synchronized</code> blocks or C++ <code>lock_guard</code> |
| 332 | or <code>unique_lock</code> ensure that locks are properly released in the |
| 333 | event of an exception. |
| 334 | </dd> |
| 335 | |
| 336 | <dt>Volatile/atomic variables</dt> |
| 337 | |
| 338 | <dd>Java provides <code>volatile</code> fields that support concurrent access |
| 339 | without introducing data races. Since 2011, C and C++ support |
| 340 | <code>atomic</code> variables and fields with similar semantics. These are |
| 341 | typically more difficult to use then locks, since they only ensure that |
| 342 | individual accesses to a single variable are atomic. (In C++ this normally |
| 343 | extends to simple read-modify-write operations, like increments. Java |
| 344 | requires special method calls for that.) |
| 345 | Unlike locks, <code>volatile</code> or <code>atomic</code> variables can't |
| 346 | be used directly to prevent other threads from interfering with longer code sequences.</dd> |
| 347 | |
| 348 | </dl> |
| 349 | |
| 350 | <p>It's important to note that <code>volatile</code> has very different |
| 351 | meanings in C++ and Java. In C++, <code>volatile</code> does not prevent data |
| 352 | races, though older code often uses it as a workaround for the lack of |
| 353 | <code>atomic</code> objects. This is no longer recommended; in |
| 354 | C++, use <code>atomic<T></code> for variables that can be concurrently |
| 355 | accessed by multiple threads. C++ <code>volatile</code> is meant for |
| 356 | device registers and the like.</p> |
| 357 | |
| 358 | <p>C/C++ <code>atomic</code> variables or Java <code>volatile</code> variables |
| 359 | can be used to prevent data races on other variables. If <code>flag</code> is |
| 360 | declared to have type <code>atomic<bool></code> |
| 361 | or <code>atomic_bool</code>(C/C++) or <code>volatile boolean</code> (Java), |
| 362 | and is initially false then the following snippet is data-race-free:</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 363 | |
| 364 | <table> |
| 365 | <tr> |
| 366 | <th>Thread 1</th> |
| 367 | <th>Thread 2</th> |
| 368 | </tr> |
| 369 | <tr> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 370 | <td><code>A = ...<br /> |
| 371 | flag = true</code> |
| 372 | </td> |
| 373 | <td><code>while (!flag) {}<br /> |
| 374 | ... = A</code> |
| 375 | </td> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 376 | </tr> |
| 377 | </table> |
| 378 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 379 | <p>Since Thread 2 waits for <code>flag</code> to be set, the access to |
| 380 | <code>A</code> in Thread 2 must happen after, and not concurrently with, the |
| 381 | assignment to <code>A</code> in Thread 1. Thus there is no data race on |
| 382 | <code>A</code>. The race on <code>flag</code> doesn't count as a data race, |
| 383 | since volatile/atomic accesses are not "ordinary memory accesses".</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 384 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 385 | <p>The implementation is required to prevent or hide memory reordering |
| 386 | sufficiently to make code like the preceding litmus test behave as expected. |
| 387 | This normally makes volatile/atomic memory accesses |
| 388 | substantially more expensive than ordinary accesses.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 389 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 390 | <p>Although the preceding example is data-race-free, locks together with |
| 391 | <code>Object.wait()</code> in Java or condition variables in C/C++ usually |
| 392 | provide a better solution that does not involve waiting in a loop while |
| 393 | draining battery power.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 394 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 395 | <h3 id="reordering">When memory reordering becomes visible</h3> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 396 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 397 | Data-race-free programming normally saves us from having to explicitly deal |
| 398 | with memory access reordering issues. However, there are several cases in |
| 399 | which reordering does become visible: |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 400 | |
| 401 | <ol> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 402 | <li> If your program has a bug resulting in an unintentional data race, |
| 403 | compiler and hardware transformations can become visible, and the behavior |
| 404 | of your program may be surprising. For example, if we forgot to declare |
| 405 | <code>flag</code> volatile in the preceding example, Thread 2 may see an |
| 406 | uninitialized <code>A</code>. Or the compiler may decide that flag can't |
| 407 | possibly change during Thread 2's loop and transform the program to |
| 408 | |
| 409 | <table> |
| 410 | <tr> |
| 411 | <th>Thread 1</th> |
| 412 | <th>Thread 2</th> |
| 413 | </tr> |
| 414 | <tr> |
| 415 | <td><code>A = ...<br /> |
| 416 | flag = true</code> |
| 417 | </td> |
| 418 | <td>reg0 = flag; |
| 419 | while (!reg0) {}<br /> |
| 420 | ... = A |
| 421 | </td> |
| 422 | </tr> |
| 423 | </table> |
| 424 | |
| 425 | When you debug, you may well see the loop continuing forever in spite of |
| 426 | the fact that <code>flag</code> is true.</li> |
| 427 | |
| 428 | <li> C++ provides facilities for explicitly relaxing |
| 429 | sequential consistency even if there are no races. Atomic operations |
| 430 | can take explicit <code>memory_order_</code>... arguments. Similarly, the |
| 431 | <code>java.util.concurrent.atomic</code> package provides a more restricted |
| 432 | set of similar facilities, notably <code>lazySet()</code>. And Java |
| 433 | programmers occasionally use intentional data races for similar effect. |
| 434 | All of these provide performance improvements at a large |
| 435 | cost in programming complexity. We discuss them only briefly |
| 436 | <a href="#weak">below</a>.</li> |
| 437 | |
| 438 | <li> Some C and C++ code is written in an older style, not entirely |
| 439 | consistent with current language standards, in which <code>volatile</code> |
| 440 | variables are used instead of <code>atomic</code> ones, and memory ordering |
| 441 | is explicitly disallowed by inserting so called <i>fences</i> or |
| 442 | <i>barriers</i>. This requires explicit reasoning about access |
| 443 | reordering and understanding of hardware memory models. A coding style |
| 444 | along these lines is still used in the Linux kernel. It should not |
| 445 | be used in new Android applications, and is also not further discussed here. |
| 446 | </li> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 447 | </ol> |
| 448 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 449 | <h2 id="practice">Practice</h2> |
| 450 | |
| 451 | <p>Debugging memory consistency problems can be very difficult. If a missing |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 452 | lock, <code>atomic</code> or <code>volatile</code> declaration causes |
| 453 | some code to read stale data, you may not be able to |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 454 | figure out why by examining memory dumps with a debugger. By the time you can |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 455 | issue a debugger query, the CPU cores may have all observed the full set of |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 456 | accesses, and the contents of memory and the CPU registers will appear to be in |
| 457 | an “impossible” state.</p> |
| 458 | |
| 459 | <h3 id="c_dont">What not to do in C</h3> |
| 460 | |
| 461 | <p>Here we present some examples of incorrect code, along with simple ways to |
| 462 | fix them. Before we do that, we need to discuss the use of a basic language |
| 463 | feature.</p> |
| 464 | |
David Friedman | 23bd592 | 2013-09-26 22:30:11 -0700 | [diff] [blame] | 465 | <h4 id="volatile">C/C++ and "volatile"</h4> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 466 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 467 | <p>C and C++ <code>volatile</code> declarations are a very special purpose tool. |
| 468 | They prevent <i>the compiler</i> from reordering or removing <i>volatile</i> |
| 469 | accesses. This can be helpful for code accessing hardware device registers, |
| 470 | memory mapped to more than one location, or in connection with |
| 471 | <code>setjmp</code>. But C and C++ <code>volatile</code>, unlike Java |
| 472 | <code>volatile</code>, is not designed for thread communication.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 473 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 474 | <p>In C and C++, accesses to <code>volatile</code> |
| 475 | data may be reordered with accessed to non-volatile data, and there are no |
| 476 | atomicity guarantees. Thus <code>volatile</code> can't be used for sharing data between |
| 477 | threads in portable code, even on a uniprocessor. C <code>volatile</code> usually does not |
| 478 | prevent access reordering by the hardware, so by itself it is even less useful in |
| 479 | multi-threaded SMP environments. This is the reason C11 and C++11 support |
| 480 | <code>atomic</code> objects. You should use those instead.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 481 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 482 | <p>A lot of older C and C++ code still abuses <code>volatile</code> for thread |
| 483 | communication. This often works correctly for data that fits |
| 484 | in a machine register, provided it is used with either explicit fences or in cases |
| 485 | in which memory ordering is not important. But it is not guaranteed to work |
| 486 | correctly with future compilers.</p> |
| 487 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 488 | |
| 489 | <h4 id="examplesc">Examples</h4> |
| 490 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 491 | <p>In most cases you’d be better off with a lock (like a |
| 492 | <code>pthread_mutex_t</code> or C++11 <code>std::mutex</code>) rather than an |
| 493 | atomic operation, but we will employ the latter to illustrate how they would be |
| 494 | used in a practical situation.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 495 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 496 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 497 | <pre>MyThing* gGlobalThing = NULL; // Wrong! See below. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 498 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 499 | void initGlobalThing() // runs in Thread 1 |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 500 | { |
| 501 | MyStruct* thing = malloc(sizeof(*thing)); |
| 502 | memset(thing, 0, sizeof(*thing)); |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 503 | thing->x = 5; |
| 504 | thing->y = 10; |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 505 | /* initialization complete, publish */ |
| 506 | gGlobalThing = thing; |
| 507 | } |
| 508 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 509 | void useGlobalThing() // runs in Thread 2 |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 510 | { |
| 511 | if (gGlobalThing != NULL) { |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 512 | int i = gGlobalThing->x; // could be 5, 0, or uninitialized data |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 513 | ... |
| 514 | } |
| 515 | }</pre> |
| 516 | |
| 517 | <p>The idea here is that we allocate a structure, initialize its fields, and at |
| 518 | the very end we “publish” it by storing it in a global variable. At that point, |
| 519 | any other thread can see it, but that’s fine since it’s fully initialized, |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 520 | right?</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 521 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 522 | <p>The problem is that the store to <code>gGlobalThing</code> could be observed |
| 523 | before the fields are initialized, typically because either the compiler or the |
| 524 | processor reordered the stores to <code>gGlobalThing</code> and |
| 525 | <code>thing->x</code>. Another thread reading from <code>thing->x</code> could |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 526 | see 5, 0, or even uninitialized data.</p> |
| 527 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 528 | <p>The core problem here is a data race on <code>gGlobalThing</code>. |
| 529 | If Thread 1 calls <code>initGlobalThing()</code> while Thread 2 |
| 530 | calls <code>useGlobalThing()</code>, <code>gGlobalThing</code> can be |
| 531 | read while being written. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 532 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 533 | <p>This can be fixed by declaring <code>gGlobalThing</code> as |
| 534 | atomic. In C++11:</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 535 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 536 | <pre>atomic<MyThing*> gGlobalThing(NULL);</pre> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 537 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 538 | <p>This ensures that the writes will become visible to other threads |
| 539 | in the proper order. It also guarantees to prevent some other failure |
| 540 | modes that are otherwise allowed, but unlikely to occur on real |
| 541 | Android hardware. For example, it ensures that we cannot see a |
| 542 | <code>gGlobalThing</code> pointer that has only been partially written.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 543 | |
| 544 | <h3 id="j_dont">What not to do in Java</h3> |
| 545 | |
| 546 | <p>We haven’t discussed some relevant Java language features, so we’ll take a |
| 547 | quick look at those first.</p> |
| 548 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 549 | <p>Java technically does not require code to be data-race-free. And there |
| 550 | is a small amount of very-carefully-written Java code that works correctly |
| 551 | in the presence of data races. However, writing such code is extremely |
| 552 | tricky, and we discuss it only briefly below. To make matters |
| 553 | worse, the experts who specified the meaning of such code no longer believe the |
| 554 | specification is correct. (The specification is fine for data-race-free |
| 555 | code.) |
| 556 | |
| 557 | <p>For now we will adhere to the data-race-free model, for which Java provides |
| 558 | essentially the same guarantees as C and C++. Again, the language provides |
| 559 | some primitives that explicitly relax sequential consistency, notably the |
| 560 | <code>lazySet()</code> and <code>weakCompareAndSet()</code> calls |
| 561 | in <code>java.util.concurrent.atomic</code>. |
| 562 | As with C and C++, we will ignore these for now. |
| 563 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 564 | <h4 id="sync_volatile">Java's "synchronized" and "volatile" keywords</h4> |
| 565 | |
| 566 | <p>The “synchronized” keyword provides the Java language’s in-built locking |
| 567 | mechanism. Every object has an associated “monitor” that can be used to provide |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 568 | mutually exclusive access. If two threads try to "synchronize" on the |
| 569 | same object, one of them will wait until the other completes.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 570 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 571 | <p>As we mentioned above, Java's <code>volatile T</code> is the analog of |
| 572 | C++11's <code>atomic<T></code>. Concurrent accesses to |
| 573 | <code>volatile</code> fields are allowed, and don't result in data races. |
| 574 | Ignoring <code>lazySet()</code> et al. and data races, it is the Java VM's job to |
| 575 | make sure that the result still appears sequentially consistent. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 576 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 577 | <p>In particular, if thread 1 writes to a <code>volatile</code> field, and |
| 578 | thread 2 subsequently reads from that same field and sees the newly written |
| 579 | value, then thread 2 is also guaranteed to see all writes previously made by |
| 580 | thread 1. In terms of memory effect, writing to |
| 581 | a volatile is analogous to a monitor release, and |
| 582 | reading from a volatile is like a monitor acquire.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 583 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 584 | <p>There is one notable difference from C++'s <code>atomic</code>: |
| 585 | If we write <code>volatile int x;</code> |
| 586 | in Java, then <code>x++</code> is the same as <code>x = x + 1</code>; it |
| 587 | performs an atomic load, increments the result, and then performs an atomic |
| 588 | store. Unlike C++, the increment as a whole is not atomic. |
| 589 | Atomic increment operations are instead provided by |
| 590 | the <code>java.util.concurrent.atomic</code>.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 591 | |
| 592 | <h4 id="examplesj">Examples</h4> |
| 593 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 594 | <p>Here’s a simple, <em>incorrect</em> implementation of a monotonic counter: <span |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 595 | style="font-size:.9em;color:#777">(<em><a href="#more" style="color:#777">Java |
| 596 | theory and practice: Managing volatility</a></em>)</span>.</p> |
| 597 | |
| 598 | <pre>class Counter { |
| 599 | private int mValue; |
| 600 | |
| 601 | public int get() { |
| 602 | return mValue; |
| 603 | } |
| 604 | public void incr() { |
| 605 | mValue++; |
| 606 | } |
| 607 | }</pre> |
| 608 | |
| 609 | <p>Assume <code>get()</code> and <code>incr()</code> are called from multiple |
| 610 | threads, and we want to be sure that every thread sees the current count when |
| 611 | <code>get()</code> is called. The most glaring problem is that |
| 612 | <code>mValue++</code> is actually three operations:</p> |
| 613 | |
| 614 | <ol> |
| 615 | <li><code>reg = mValue</code></li> |
| 616 | <li><code>reg = reg + 1</code></li> |
| 617 | <li><code>mValue = reg</code></li> |
| 618 | </ol> |
| 619 | |
| 620 | <p>If two threads execute in <code>incr()</code> simultaneously, one of the |
| 621 | updates could be lost. To make the increment atomic, we need to declare |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 622 | <code>incr()</code> “synchronized”.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 623 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 624 | <p>It’s still broken however, especially on SMP. There is still a data race, |
| 625 | in that <code>get()</code> can access <code>mValue</code> concurrently with |
| 626 | <code>incr()</code>. Under Java rules, the <code>get()</code> call can be |
| 627 | appear to be reordered with respect to other code. For example, if we read two |
| 628 | counters in a row, the results might appear to be inconsistent |
| 629 | because the <code>get()</code> calls we reordered, either by the hardware or |
| 630 | compiler. We can correct the problem by declaring <code>get()</code> to be |
| 631 | synchronized. With this change, the code is obviously correct.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 632 | |
| 633 | <p>Unfortunately, we’ve introduced the possibility of lock contention, which |
| 634 | could hamper performance. Instead of declaring <code>get()</code> to be |
| 635 | synchronized, we could declare <code>mValue</code> with “volatile”. (Note |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 636 | <code>incr()</code> must still use <code>synchronize</code> since |
| 637 | <code>mValue++</code> is otherwise not a single atomic operation.) |
| 638 | This also avoids all data races, so sequential consistency is preserved. |
| 639 | <code>incr()</code> will be somewhat slower, since it incurs both monitor entry/exit |
| 640 | overhead, and the overhead associated with a volatile store, but |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 641 | <code>get()</code> will be faster, so even in the absence of contention this is |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 642 | a win if reads greatly outnumber writes. (See also {@link |
| 643 | java.util.concurrent.atomic.AtomicInteger} for a way to completely |
| 644 | remove the synchronized block.)</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 645 | |
| 646 | <p>Here’s another example, similar in form to the earlier C examples:</p> |
| 647 | |
| 648 | <pre>class MyGoodies { |
| 649 | public int x, y; |
| 650 | } |
| 651 | class MyClass { |
| 652 | static MyGoodies sGoodies; |
| 653 | |
| 654 | void initGoodies() { // runs in thread 1 |
| 655 | MyGoodies goods = new MyGoodies(); |
| 656 | goods.x = 5; |
| 657 | goods.y = 10; |
| 658 | sGoodies = goods; |
| 659 | } |
| 660 | |
| 661 | void useGoodies() { // runs in thread 2 |
| 662 | if (sGoodies != null) { |
| 663 | int i = sGoodies.x; // could be 5 or 0 |
| 664 | .... |
| 665 | } |
| 666 | } |
| 667 | }</pre> |
| 668 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 669 | <p>This has the same problem as the C code, namely that there is |
| 670 | a data race on <code>sGoodies</code>. Thus the assignment |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 671 | <code>sGoodies = goods</code> might be observed before the initialization of the |
| 672 | fields in <code>goods</code>. If you declare <code>sGoodies</code> with the |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 673 | <code>volatile</code> keyword, sequential consistency is restored, and things will work |
| 674 | as expected. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 675 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 676 | <p>Note that only the <code>sGoodies</code> reference itself is volatile. The |
| 677 | accesses to the fields inside it are not. Once <code>sGoodies</code> is |
| 678 | <code>volatile</code>, and memory ordering is properly preserved, the fields |
| 679 | cannot be concurrently accessed. The statement <code>z = |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 680 | sGoodies.x</code> will perform a volatile load of <code>MyClass.sGoodies</code> |
| 681 | followed by a non-volatile load of <code>sGoodies.x</code>. If you make a local |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 682 | reference <code>MyGoodies localGoods = sGoodies</code>, then a subsequent <code>z = |
| 683 | localGoods.x</code> will not perform any volatile loads.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 684 | |
| 685 | <p>A more common idiom in Java programming is the infamous “double-checked |
| 686 | locking”:</p> |
| 687 | |
| 688 | <pre>class MyClass { |
| 689 | private Helper helper = null; |
| 690 | |
| 691 | public Helper getHelper() { |
| 692 | if (helper == null) { |
| 693 | synchronized (this) { |
| 694 | if (helper == null) { |
| 695 | helper = new Helper(); |
| 696 | } |
| 697 | } |
| 698 | } |
| 699 | return helper; |
| 700 | } |
| 701 | }</pre> |
| 702 | |
| 703 | <p>The idea is that we want to have a single instance of a <code>Helper</code> |
| 704 | object associated with an instance of <code>MyClass</code>. We must only create |
| 705 | it once, so we create and return it through a dedicated <code>getHelper()</code> |
| 706 | function. To avoid a race in which two threads create the instance, we need to |
| 707 | synchronize the object creation. However, we don’t want to pay the overhead for |
| 708 | the “synchronized” block on every call, so we only do that part if |
| 709 | <code>helper</code> is currently null.</p> |
| 710 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 711 | <p>This has a data race on the <code>helper</code> field. It can be |
| 712 | set concurrently with the <code>helper == null</code> in another thread. |
| 713 | </p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 714 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 715 | <p>To see how this can fail, consider |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 716 | the same code rewritten slightly, as if it were compiled into a C-like language |
| 717 | (I’ve added a couple of integer fields to represent <code>Helper’s</code> |
| 718 | constructor activity):</p> |
| 719 | |
| 720 | <pre>if (helper == null) { |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 721 | synchronized() { |
| 722 | if (helper == null) { |
| 723 | newHelper = malloc(sizeof(Helper)); |
| 724 | newHelper->x = 5; |
| 725 | newHelper->y = 10; |
| 726 | helper = newHelper; |
| 727 | } |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 728 | } |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 729 | return helper; |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 730 | }</pre> |
| 731 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 732 | <p>There is nothing to prevent either the hardware or the compiler |
| 733 | from reordering the store to <code>helper</code> with those to the |
| 734 | <code>x</code>/<code>y</code> fields. Another thread could find |
| 735 | <code>helper</code> non-null but its fields not yet set and ready to use. |
| 736 | For more details and more failure modes, see the “‘Double Checked |
| 737 | Locking is Broken’ Declaration” link in the appendix for more details, or Item |
| 738 | 71 (“Use lazy initialization judiciously”) in Josh Bloch’s <em>Effective Java, |
| 739 | 2nd Edition.</em>.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 740 | |
| 741 | <p>There are two ways to fix this:</p> |
| 742 | <ol> |
| 743 | <li>Do the simple thing and delete the outer check. This ensures that we never |
| 744 | examine the value of <code>helper</code> outside a synchronized block.</li> |
| 745 | <li>Declare <code>helper</code> volatile. With this one small change, the code |
| 746 | in Example J-3 will work correctly on Java 1.5 and later. (You may want to take |
| 747 | a minute to convince yourself that this is true.)</li> |
| 748 | </ol> |
| 749 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 750 | <p>Here is another illustration of <code>volatile</code> behavior:</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 751 | |
| 752 | <pre>class MyClass { |
| 753 | int data1, data2; |
| 754 | volatile int vol1, vol2; |
| 755 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 756 | void setValues() { // runs in Thread 1 |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 757 | data1 = 1; |
| 758 | vol1 = 2; |
| 759 | data2 = 3; |
| 760 | } |
| 761 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 762 | void useValues() { // runs in Thread 2 |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 763 | if (vol1 == 2) { |
| 764 | int l1 = data1; // okay |
| 765 | int l2 = data2; // wrong |
| 766 | } |
| 767 | } |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 768 | }</pre> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 769 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 770 | <p>Looking at <code>useValues()</code>, if Thread 2 hasn’t yet observed the |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 771 | update to <code>vol1</code>, then it can’t know if <code>data1</code> or |
| 772 | <code>data2</code> has been set yet. Once it sees the update to |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 773 | <code>vol1</code>, it knows that <code>data1</code> can be safely accessed |
| 774 | and correctly read without introducing a data race. However, |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 775 | it can’t make any assumptions about <code>data2</code>, because that store was |
| 776 | performed after the volatile store.</p> |
| 777 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 778 | <p>Note that <code>volatile</code> cannot be used to prevent reordering |
| 779 | of other memory accesses that race with each other. It is not guaranteed to |
| 780 | generate a machine memory fence instruction. It can be used to prevent |
| 781 | data races by executing code only when another thread has satisfied a |
| 782 | certain condition. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 783 | |
| 784 | <h3 id="bestpractice">What to do</h3> |
| 785 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 786 | <p>In C/C++, prefer C++11 |
| 787 | synchronization classes, such as <code>std::mutex</code>. If not, use |
| 788 | the corresponding <code>pthread</code> operations. |
| 789 | These include the proper memory fences, providing correct (sequentially consistent |
| 790 | unless otherwise specified) |
| 791 | and efficient behavior on all Android platform versions. Be sure to use them |
| 792 | correctly. For example, remember that condition variable waits may spuriously |
| 793 | return without being signaled, and should thus appear in a loop.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 794 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 795 | <p>It's best to avoid using atomic functions directly, unless the data structure |
| 796 | you are implementing is extremely simple, like a counter. Locking and |
| 797 | unlocking a pthread mutex require a single atomic operation each, |
| 798 | and often cost less than a single cache miss, if there’s no |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 799 | contention, so you’re not going to save much by replacing mutex calls with |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 800 | atomic ops. Lock-free designs for non-trivial data structures require |
| 801 | much more care to ensure that higher level operations on the data structure |
| 802 | appear atomic (as a whole, not just their explicitly atomic pieces).</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 803 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 804 | <p>If you do use atomic operations, relaxing ordering with |
| 805 | <code>memory_order</code>... or <code>lazySet()</code> may provide performance |
| 806 | advantages, but requires deeper understanding than we have conveyed so far. |
| 807 | A large fraction of existing code using |
| 808 | these is discovered to have bugs after the fact. Avoid these if possible. |
| 809 | If your use cases doesn't exactly fit one of those in the next section, |
| 810 | make sure you either are an expert, or have consulted one. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 811 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 812 | <p>Avoid using <code>volatile</code> for thread communication in C/C++.</p> |
| 813 | |
| 814 | <p>In Java, concurrency problems are often best solved by |
| 815 | using an appropriate utility class from |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 816 | the {@link java.util.concurrent} package. The code is well written and well |
| 817 | tested on SMP.</p> |
| 818 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 819 | <p>Perhaps the safest thing you can do is make your objects immutable. Objects |
| 820 | from classes like Java's String and Integer hold data that cannot be changed once an |
| 821 | object is created, avoiding all potential for data races on those objects. |
| 822 | The book <em>Effective |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 823 | Java, 2nd Ed.</em> has specific instructions in “Item 15: Minimize Mutability”. Note in |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 824 | particular the importance of declaring Java fields “final" <span |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 825 | style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Bloch</a>)</span>.</p> |
| 826 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 827 | <p>Even if an object is immutable, remember that communicating it to another |
| 828 | thread without any kind of synchronization is a data race. This can occasionally |
| 829 | be acceptable in Java (see below), but requires great care, and is likely to result in |
| 830 | brittle code. If it's not extremely performance critical, add a |
| 831 | <code>volatile</code> declaration. In C++, communicating a pointer or |
| 832 | reference to an immutable object without proper synchronization, |
| 833 | like any data race, is a bug. |
| 834 | In this case, it is reasonably likely to result in intermittent crashes since, |
| 835 | for example, the receiving thread may see an uninitialized method table |
| 836 | pointer due to store reordering.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 837 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 838 | <p>If neither an existing library class, nor an immutable class is |
| 839 | appropriate, the Java <code>synchronized</code> statement or C++ |
| 840 | <code>lock_guard</code> / <code>unique_lock</code> should be used to guard |
| 841 | accesses to any field that can be accessed by more than one thread. If mutexes won’t |
| 842 | work for your situation, you should declare shared fields |
| 843 | <code>volatile</code> or <code>atomic</code>, but you must take great care to |
| 844 | understand the interactions between threads. These declarations won’t |
| 845 | save you from common concurrent programming mistakes, but they will help you |
| 846 | avoid the mysterious failures associated with optimizing compilers and SMP |
| 847 | mishaps.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 848 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 849 | <p>You should avoid |
| 850 | "publishing" a reference to an object, i.e. making it available to other |
| 851 | threads, in its constructor. This is less critical in C++ or if you stick to |
| 852 | our "no data races" advice in Java. But it's always good advice, and becomes |
| 853 | critical if your Java code is |
| 854 | run in other contexts in which the Java security model matters, and untrusted |
| 855 | code may introduce a data race by accessing that "leaked" object reference. |
| 856 | It's also critical if you choose to ignore our warnings and use some of the techniques |
| 857 | in the next section. |
| 858 | See <span style="font-size:.9em;color:#777">(<a href="#more" |
| 859 | style="color:#777">Safe Construction Techniques in Java</a>)</span> for |
| 860 | details</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 861 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 862 | <h2 id="weak">A little more about weak memory orders</h2> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 863 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 864 | <p>C++11 and later provide explicit mechanisms for relaxing the sequential |
| 865 | consistency guarantees for data-race-free programs. Explicit |
| 866 | <code>memory_order_relaxed</code>, <code>memory_order_acquire</code> (loads |
| 867 | only), and <code>memory_order_release</code>(stores only) arguments for atomic |
| 868 | operations each provide strictly weaker guarantees than the default, typically |
| 869 | implicit, <code>memory_order_seq_cst</code>. <code>memory_order_acq_rel</code> |
| 870 | provides both <code>memory_order_acquire</code> and |
| 871 | <code>memory_order_release</code> guarantees for atomic read-modify write |
| 872 | operations. <code>memory_order_consume</code> is not yet sufficiently |
| 873 | well specified or implemented to be useful, and should be ignored for now.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 874 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 875 | <p>The <code>lazySet</code> methods in <code>Java.util.concurrent.atomic</code> |
| 876 | are similar to C++ <code>memory_order_release</code> stores. Java's |
| 877 | ordinary variables are sometimes used as a replacement for |
| 878 | <code>memory_order_relaxed</code> accesses, though they are actually |
| 879 | even weaker. Unlike C++, there is no real mechanism for unordered |
| 880 | accesses to variables that are declared as <code>volatile</code>.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 881 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 882 | <p>You should generally avoid these unless there are pressing performance reasons to |
| 883 | use them. On weakly ordered machine architectures like ARM, using them will |
| 884 | commonly save on the order of a few dozen machine cycles for each atomic operation. |
| 885 | On x86, the performance win is limited to stores, and likely to be less |
| 886 | noticeable. |
| 887 | Somewhat counter-intuitively, the benefit may decrease with larger core counts, |
| 888 | as the memory system becomes more of a limiting factor.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 889 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 890 | <p>The full semantics of weakly ordered atomics are complicated. |
| 891 | In general they require |
| 892 | precise understanding of the language rules, which we will |
| 893 | not go into here. For example: |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 894 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 895 | <ul> |
| 896 | <li> The compiler or hardware can move <code>memory_order_relaxed</code> |
| 897 | accesses into (but not out of) a critical section bounded by a lock |
| 898 | acquisition and release. This means that two |
| 899 | <code>memory_order_relaxed</code> stores may become visible out of order, |
| 900 | even if they are separated by a critical section. |
| 901 | <li> An ordinary Java variable, when abused as a shared counter, may appear |
| 902 | to another thread to decrease, even though it is only incremented by a single |
| 903 | other thread. But this is not true for C++ atomic |
| 904 | <code>memory_order_relaxed</code>. |
| 905 | </ul> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 906 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 907 | With that as a warning, |
| 908 | here we give a small number of idioms that seem to cover many of the use |
| 909 | cases for weakly ordered atomics. Many of these are applicable only to C++: |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 910 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 911 | <h3 id="nonracing">Non-racing accesses</h3> |
| 912 | <p>It is fairly common that a variable is atomic because it is <em>sometimes</em> |
| 913 | read concurrently with a write, but not all accesses have this issue. |
| 914 | For example a variable |
| 915 | may need to be atomic because it is read outside a critical section, but all |
| 916 | updates are protected by a lock. In that case, a read that happens to be |
| 917 | protected by the same lock |
| 918 | cannot race, since there cannot be concurrent writes. In such a case, the |
| 919 | non-racing access (load in this case), can be annotated with |
| 920 | <code>memory_order_relaxed</code> without changing the correctness of C++ code. |
| 921 | The lock implementation already enforces the required memory ordering |
| 922 | with respect to access by other threads, and <code>memory_order_relaxed</code> |
| 923 | specifies that essentially no additional ordering constraints need to be |
| 924 | enforced for the atomic access.</p> |
| 925 | |
| 926 | <p>There is no real analog to this in Java.</p> |
| 927 | |
| 928 | <h3 id="hint_only">Result is not relied upon for correctness</h3> |
| 929 | |
| 930 | <p>When we use a racing load only to generate a hint, it's generally also OK |
| 931 | to not enforce any memory ordering for the load. If the value is |
| 932 | not reliable, we also can't reliably use the result to infer anything about |
| 933 | other variables. Thus it's OK |
| 934 | if memory ordering is not guaranteed, and the load is |
| 935 | supplied with a <code>memory_order_relaxed</code> argument.</p> |
| 936 | |
| 937 | <p>A common |
| 938 | instance of this is the use of C++ <code>compare_exchange</code> |
| 939 | to atomically replace <code>x</code> by <code>f(x)</code>. |
| 940 | The initial load of <code>x</code> to compute <code>f(x)</code> |
| 941 | does not need to be reliable. If we get it wrong, the |
| 942 | <code>compare_exchange</code> will fail and we will retry. |
| 943 | It is fine for the initial load of <code>x</code> to use |
| 944 | a <code>memory_order_relaxed</code> argument; only memory ordering |
| 945 | for the actual <code>compare_exchange</code> matters.</p> |
| 946 | |
| 947 | <h3 id="unread">Atomically modified but unread data</h3> |
| 948 | |
| 949 | <p>Occasionally data is modified in parallel by multiple threads, but |
| 950 | not examined until the parallel computation is complete. A good |
| 951 | example of this is a counter that is atomically incremented (e.g. |
| 952 | using <code>fetch_add()</code> in C++ or |
| 953 | <code>atomic_fetch_add_explicit()</code> |
| 954 | in C) by multiple threads in parallel, but the result of these calls |
| 955 | is always ignored. The resulting value is only read at the end, |
| 956 | after all updates are complete.</p> |
| 957 | |
| 958 | <p>In this case, there is no way to tell whether accesses to this data |
| 959 | was reordered, and hence C++ code may use a <code>memory_order_relaxed</code> |
| 960 | argument.<p> |
| 961 | |
| 962 | <p>Simple event counters are a common example of this. Since it is |
| 963 | so common, it is worth making some observations about this case:</p> |
| 964 | |
| 965 | <ul> |
| 966 | <li> Use of <code>memory_order_relaxed</code> improves performance, |
| 967 | but may not address the most important performance issue: Every update |
| 968 | requires exclusive access to the cache line holding the counter. This |
| 969 | results in a cache miss every time a new thread accesses the counter. |
| 970 | If updates are frequent and alternate between threads, it is much faster |
| 971 | to avoid updating the shared counter every time by, |
| 972 | for example, using thread-local counters and summing them at the end. |
| 973 | <li> This technique is combinable with the previous section: It is possible to |
| 974 | concurrently read approximate and unreliable values while they are being updated, |
| 975 | with all operations using <code>memory_order_relaxed</code>. |
| 976 | But it is important to treat the resulting values as completely unreliable. |
| 977 | Just because the count appears to have been incremented once does not |
| 978 | mean another thread can be counted on to have reached the point |
| 979 | at which the increment has been performed. The increment may instead have |
| 980 | been reordered with earlier code. (As for the similar case we mentioned |
| 981 | earlier, C++ does guarantee that a second load of such a counter will not |
| 982 | return a value less than an earlier load in the same thread. Unless of |
| 983 | course the counter overflowed.) |
| 984 | <li> It is common to find code that tries to compute approximate |
| 985 | counter values by performing individual atomic (or not) reads and writes, but |
| 986 | not making the increment as a whole atomic. The usual argument is that |
| 987 | this is "close enough" for performance counters or the like. |
| 988 | It's typically not. |
| 989 | When updates are sufficiently frequent (a case |
| 990 | you probably care about), a large fraction of the counts are typically |
| 991 | lost. On a quad core device, more than half the counts may commonly be lost. |
| 992 | (Easy exercise: construct a two thread scenario in which the counter is |
| 993 | updated a million times, but the final counter value is one.) |
| 994 | </ul> |
| 995 | |
| 996 | <h3 id="flag">Simple Flag communication</h3> |
| 997 | |
| 998 | <p>A <code>memory_order_release</code> store (or read-modify-write operation) |
| 999 | ensures that if subsequently a <code>memory_order_acquire</code> load |
| 1000 | (or read-modify-write operation) reads the written value, then it will |
| 1001 | also observe any stores (ordinary or atomic) that preceded the |
| 1002 | A <code>memory_order_release</code> store. Conversely, any loads |
| 1003 | preceding the <code>memory_order_release</code> will not observe any |
| 1004 | stores that followed the <code>memory_order_acquire</code> load. |
| 1005 | Unlike <code>memory_order_relaxed</code>, this allows such atomic operations |
| 1006 | to be used to communicate the progress of one thread to another.</p> |
| 1007 | |
| 1008 | <p>For example, we can rewrite the double-checked locking example |
| 1009 | from above in C++ as</p> |
| 1010 | |
| 1011 | <pre> |
| 1012 | class MyClass { |
| 1013 | private: |
| 1014 | atomic<Helper*> helper {nullptr}; |
| 1015 | mutex mtx; |
| 1016 | |
| 1017 | public: |
| 1018 | Helper* getHelper() { |
| 1019 | Helper* myHelper = helper.load(memory_order_acquire); |
| 1020 | if (myHelper == nullptr) { |
| 1021 | lock_guard<mutex> lg(mtx); |
| 1022 | myHelper = helper.load(memory_order_relaxed); |
| 1023 | if (myHelper == nullptr) { |
| 1024 | myHelper = new Helper(); |
| 1025 | helper.store(myHelper, memory_order_release); |
| 1026 | } |
| 1027 | } |
| 1028 | return myHelper; |
| 1029 | } |
| 1030 | }; |
| 1031 | </pre> |
| 1032 | |
| 1033 | <p>The acquire load and release store ensure that if we see a non-null |
| 1034 | <code>helper</code>, then we will also see its fields correctly initialized. |
| 1035 | We've also incorporated the prior observation that non-racing loads |
| 1036 | can use <code>memory_order_relaxed</code>.</p> |
| 1037 | |
| 1038 | <p>A Java programmer could conceivably represent <code>helper</code> as a |
| 1039 | <code>java.util.concurrent.atomic.AtomicReference<Helper></code> |
| 1040 | and use <code>lazySet()</code> as the release store. The load |
| 1041 | operations would continue to use plain <code>get()</code> calls.</p> |
| 1042 | |
| 1043 | <p>In both cases, our performance tweaking concentrated on the initialization |
| 1044 | path, which is unlikely to be performance critical. |
| 1045 | A more readable compromise might be:</p> |
| 1046 | |
| 1047 | <pre> |
| 1048 | Helper* getHelper() { |
| 1049 | Helper* myHelper = helper.load(memory_order_acquire); |
| 1050 | if (myHelper != nullptr) { |
| 1051 | return myHelper; |
| 1052 | } |
| 1053 | lock_guard<mutex> lg(mtx); |
| 1054 | if (helper == nullptr) { |
| 1055 | helper = new Helper(); |
| 1056 | } |
| 1057 | return helper; |
| 1058 | } |
| 1059 | </pre> |
| 1060 | |
| 1061 | <p>This provides the same fast path, but resorts to default, |
| 1062 | sequentially-consistent, operations on the non-performance-critical slow |
| 1063 | path.</p> |
| 1064 | |
| 1065 | <p>Even here, <code>helper.load(memory_order_acquire)</code> is |
| 1066 | likely to generate the same code on current Android-supported |
| 1067 | architectures as a plain (sequentially-consistent) reference to |
| 1068 | <code>helper</code>. Really the most beneficial optimization here |
| 1069 | may be the introduction of <code>myHelper</code> to eliminate a |
| 1070 | second load, though a future compiler might do that automatically.</p> |
| 1071 | |
| 1072 | <p>Acquire/release ordering does not prevent stores from getting visibly |
| 1073 | delayed, and does not ensure that stores become visible to other threads |
| 1074 | in a consistent order. As a result, it does not support a tricky, |
| 1075 | but fairly common coding pattern exemplified by Dekker's mutual exclusion |
| 1076 | algorithm: All threads first set a flag indicating that they want to do |
| 1077 | something; if a thread <i>t</i> then notices that no other thread is |
| 1078 | trying to do something, it can safely proceed, knowing that there |
| 1079 | will be no interference. No other thread will be |
| 1080 | able to proceed, since <i>t</i>'s flag is still set. This fails |
| 1081 | if the flag is accessed using acquire/release ordering, since that doesn't |
| 1082 | prevent making a thread's flag visible to others late, after they have |
| 1083 | erroneously proceeded. Default <code>memory_order_seq_cst</code> |
| 1084 | does prevent it.</p> |
| 1085 | |
| 1086 | <h3 id="immutable">Immutable fields</h3> |
| 1087 | |
| 1088 | <p> If an object field is initialized on first use and then never changed, |
| 1089 | it may be possible to initialize and subsequently read it using weakly |
| 1090 | ordered accesses. In C++, it could be declared as <code>atomic</code> |
| 1091 | and accessed using <code>memory_order_relaxed</code> or in Java, it |
| 1092 | could be declared without <code>volatile</code> and accessed without |
| 1093 | special measures. This requires that all of the following hold:</p> |
| 1094 | |
| 1095 | <ul> |
| 1096 | <li>It should be possible to tell from the value of the field itself |
| 1097 | whether it has already been initialized. To access the field, |
| 1098 | the fast path test-and-return value should read the field only once. |
| 1099 | In Java the latter is essential. Even if the field tests as initialized, |
| 1100 | a second load may read the earlier uninitialized value. In C++ |
| 1101 | the "read once" rule is merely good practice. |
| 1102 | <li>Both initialization and subsequent loads must be atomic, |
| 1103 | in that partial updates should not be visible. For Java, the field |
| 1104 | should not be a <code>long</code> or <code>double</code>. For C++, |
| 1105 | an atomic assignment is required; constructing it in place will not work, since |
| 1106 | construction of an <code>atomic</code> is not atomic. |
| 1107 | <li>Repeated initializations must be safe, since multiple threads |
| 1108 | may read the uninitialized value concurrently. In C++, this generally |
| 1109 | follows from the "trivially copyable" requirement imposed for all |
| 1110 | atomic types; types with nested owned pointers would require |
| 1111 | deallocation in the |
| 1112 | copy constructor, and would not be trivially copyable. For Java, |
| 1113 | certain reference types are acceptable: |
| 1114 | <li>Java references are limited to immutable types containing only final |
| 1115 | fields. The constructor of the immutable type should not publish |
| 1116 | a reference to the object. In this case the Java final field rules |
| 1117 | ensure that if a reader sees the reference, it will also see the |
| 1118 | initialized final fields. C++ has no analog to these rules and |
| 1119 | pointers to owned objects are unacceptable for this reason as well (in |
| 1120 | addition to violating the "trivially copyable" requirements). |
| 1121 | </ul> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1122 | |
| 1123 | <h2 id="closing_notes">Closing Notes</h2> |
| 1124 | |
| 1125 | <p>While this document does more than merely scratch the surface, it doesn’t |
| 1126 | manage more than a shallow gouge. This is a very broad and deep topic. Some |
| 1127 | areas for further exploration:</p> |
| 1128 | |
| 1129 | <ul> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1130 | <li>The actual Java and C++ memory models are expressed in terms of a |
| 1131 | <em>happens-before</em> relation that specifies when two actions are guaranteed |
| 1132 | to occur in a certain order. When we defined a data race, we informally |
| 1133 | talked about two memory accesses happening "simultaneously". |
| 1134 | Officially this is defined as neither one happening before the other. |
| 1135 | It is instructive to learn the actual definitions of <em>happens-before</em> |
| 1136 | and <em>synchronizes-with</em> in the Java or C++ Memory Model. |
| 1137 | Although the intuitive notion of "simultaneously" is generally good |
| 1138 | enough, these definitions are instructive, particularly if you |
| 1139 | are contemplating using weakly ordered atomic operations in C++. |
| 1140 | (The current Java specification only defines <code>lazySet()</code> |
| 1141 | very informally.)</li> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1142 | <li>Explore what compilers are and aren’t allowed to do when reordering code. |
| 1143 | (The JSR-133 spec has some great examples of legal transformations that lead to |
| 1144 | unexpected results.)</li> |
| 1145 | <li>Find out how to write immutable classes in Java and C++. (There’s more to |
| 1146 | it than just “don’t change anything after construction”.)</li> |
| 1147 | <li>Internalize the recommendations in the Concurrency section of <em>Effective |
| 1148 | Java, 2nd Edition.</em> (For example, you should avoid calling methods that are |
| 1149 | meant to be overridden while inside a synchronized block.)</li> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1150 | <li>Read through the {@link java.util.concurrent} and {@link |
| 1151 | java.util.concurrent.atomic} APIs to see what's available. Consider using |
| 1152 | concurrency annotations like <code>@ThreadSafe</code> and |
| 1153 | <code>@GuardedBy</code> (from net.jcip.annotations).</li> |
| 1154 | </ul> |
| 1155 | |
| 1156 | <p>The <a href="#more">Further Reading</a> section in the appendix has links to |
| 1157 | documents and web sites that will better illuminate these topics.</p> |
| 1158 | |
| 1159 | <h2 id="appendix">Appendix</h2> |
| 1160 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1161 | <h3 id="sync_stores">Implementing synchronization stores</h3> |
| 1162 | |
| 1163 | <p><em>(This isn’t something most programmers will find themselves implementing, |
| 1164 | but the discussion is illuminating.)</em></p> |
| 1165 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1166 | <p> For small built-in types like <code>int</code>, and hardware supported by |
| 1167 | Android, ordinary load and store instructions ensure that a store |
| 1168 | will be made visible either in its entirety, or not at all, to another |
| 1169 | processor loading the same location. Thus some basic notion |
| 1170 | of "atomicity" is provided for free.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1171 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1172 | <p> As we saw before, this does not suffice. In order to ensure sequential |
| 1173 | consistency we also need to prevent reordering of operations, and to ensure |
| 1174 | that memory operations become visible to other processes in a consistent |
| 1175 | order. It turns out that the latter is automatic on Android-supported |
| 1176 | hardware, provided we make judicious choices for enforcing the former, |
| 1177 | so we largely ignore it here.</p> |
| 1178 | |
| 1179 | <p> Order of memory operations is preserved by both preventing reordering |
| 1180 | by the compiler, and preventing reordering by the hardware. Here we focus |
| 1181 | on the latter. |
| 1182 | |
| 1183 | <p> Memory ordering on current hardware is generally enforced with |
| 1184 | "fence" instructions that |
| 1185 | roughly prevent instructions following the fence from becoming visible |
| 1186 | before instructions preceding the fence. (These are also commonly |
| 1187 | called "barrier" instructions, but that risks confusion with |
| 1188 | <code>pthread_barrier</code>-style barriers, which do much more |
| 1189 | than this.) The precise meaning of |
| 1190 | fence instructions is a fairly complicated topic that has to address |
| 1191 | the way in which guarantees provided by multiple different kinds of fences |
| 1192 | interact, and how these combine with other ordering guarantees usually |
| 1193 | provided by the hardware. This is a high level overview, so we will |
| 1194 | gloss over these details.</p> |
| 1195 | |
| 1196 | <p> The most basic kind of ordering guarantee is that provided by C++ |
| 1197 | <code>memory_order_acquire</code> and <code>memory_order_release</code> |
| 1198 | atomic operations: Memory operations preceding a release store |
| 1199 | should be visible following an acquire load. On ARM, this is |
| 1200 | enforced by:</p> |
| 1201 | |
| 1202 | <ul> |
| 1203 | <li>Preceding the store instruction with a suitable fence instruction. |
| 1204 | This prevents all prior memory accesses from being reordered with the |
| 1205 | store instruction. (It also unnecessarily prevents reordering with |
| 1206 | later store instruction. ARMv8 provides an alternative that doesn't share |
| 1207 | this problem.) |
| 1208 | <li>Following the load instruction with a suitable fence instruction, |
| 1209 | preventing the load from being reordered with subsequent accesses. |
| 1210 | (And once again providing unneeded ordering with at least earlier loads.) |
| 1211 | </ul> |
| 1212 | |
| 1213 | <p> Together these suffice for C++ acquire/release ordering. |
| 1214 | They are necessary, but not sufficient, for Java <code>volatile</code> |
| 1215 | or C++ sequentially consistent <code>atomic</code>.</p> |
| 1216 | |
| 1217 | <p>To see what else we need, consider the fragment of Dekker’s algorithm |
| 1218 | we briefly mentioned earlier. |
| 1219 | <code>flag1</code> and <code>flag2</code> are C++ <code>atomic</code> |
| 1220 | or Java <code>volatile</code> variables, both initially false.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1221 | |
| 1222 | <table> |
| 1223 | <tr> |
| 1224 | <th>Thread 1</th> |
| 1225 | <th>Thread 2</th> |
| 1226 | </tr> |
| 1227 | <tr> |
| 1228 | <td><code>flag1 = true<br /> |
| 1229 | if (flag2 == false)<br /> |
| 1230 | <em>critical-stuff</em></code></td> |
| 1231 | <td><code>flag2 = true<br /> |
| 1232 | if (flag1 == false)<br /> |
| 1233 | <em>critical-stuff</em></code></td> |
| 1234 | </tr> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1235 | </table> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1236 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1237 | <p>Sequential consistency implies that one of the assignments to |
| 1238 | <code>flag</code><i>n</i> must be executed first, and be seen by the |
| 1239 | test in the other thread. Thus, we will never see |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1240 | these threads executing the “critical-stuff” simultaneously.</p> |
| 1241 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1242 | <p>But the fencing required for acquire-release ordering only adds |
| 1243 | fences at the beginning and end of each thread, which doesn't help |
| 1244 | here. We additionally need to ensure that if a |
| 1245 | <code>volatile</code>/<code>atomic</code> store is followed by |
| 1246 | a <code>volatile</code>/<code>atomic</code> load, the two are not reordered. |
| 1247 | This is normally enforced by add a fence not just before a |
| 1248 | sequentially consistent store, but also after it. |
| 1249 | (This is again much stronger than required, since this fence typically orders |
| 1250 | all earlier memory accesses with respect to all later ones. Again ARMv8 |
| 1251 | offers a more targeted solution.)</p> |
| 1252 | |
| 1253 | <p>We could instead associate the extra fence with sequentially |
| 1254 | consistent loads. Since stores are less frequent, the convention |
| 1255 | we described is more common and used on Android.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1256 | |
| 1257 | <p>As we saw in an earlier section, we need to insert a store/load barrier |
| 1258 | between the two operations. The code executed in the VM for a volatile access |
| 1259 | will look something like this:</p> |
| 1260 | |
| 1261 | <table> |
| 1262 | <tr> |
| 1263 | <th>volatile load</th> |
| 1264 | <th>volatile store</th> |
| 1265 | </tr> |
| 1266 | <tr> |
| 1267 | <td><code>reg = A<br /> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1268 | <em>fence for "acquire" (1)</em></code></td> |
| 1269 | <td><code><em>fence for "release" (2)</em><br /> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1270 | A = reg<br /> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1271 | <em>fence for later atomic load (3)</em></code></td> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1272 | </tr> |
| 1273 | </table> |
| 1274 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1275 | <p>Real machine architectures commonly provide multiple types of |
| 1276 | fences, which order different types of accesses and may have |
| 1277 | different cost. The choice between these is subtle, and influenced |
| 1278 | by the need to ensure that stores are made visible to other cores in |
| 1279 | a consistent order, and that the memory ordering imposed by the |
| 1280 | combination of multiple fences composes correctly. For more details, |
| 1281 | please see the University of Cambridge page with <a href="#more"> |
| 1282 | collected mappings of atomics to actual processors</a>.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1283 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1284 | <p>On some architectures, notably x86, the "acquire" and "release" |
| 1285 | barriers are unnecessary, since the hardware always implicitly |
| 1286 | enforces sufficient ordering. Thus on x86 only the last fence (3) |
| 1287 | is really generated. Similarly on x86, atomic read-modify-write |
| 1288 | operations implicitly include a strong fence. Thus these never |
| 1289 | require any fences. On ARM all fences we discussed above are |
| 1290 | required.</p> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1291 | |
| 1292 | <h3 id="more">Further reading</h3> |
| 1293 | |
| 1294 | <p>Web pages and documents that provide greater depth or breadth. The more generally useful articles are nearer the top of the list.</p> |
| 1295 | |
| 1296 | <dl> |
| 1297 | <dt>Shared Memory Consistency Models: A Tutorial</dt> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1298 | <dd>Written in 1995 by Adve & Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1299 | <br /><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</a></dd> |
| 1300 | |
| 1301 | <dt>Memory Barriers</dt> |
| 1302 | <dd>Nice little article summarizing the issues. |
| 1303 | <br /><a href="http://en.wikipedia.org/wiki/Memory_barrier">http://en.wikipedia.org/wiki/Memory_barrier</a></dd> |
| 1304 | |
| 1305 | <dt>Threads Basics</dt> |
| 1306 | <dd>An introduction to multi-threaded programming in C++ and Java, by Hans Boehm. Excellent discussion of data races and basic synchronization methods. |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1307 | <br /><a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html">http://www.hboehm.info/c++mm/threadsintro.html</a></dd> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1308 | |
| 1309 | <dt>Java Concurrency In Practice</dt> |
| 1310 | <dd>Published in 2006, this book covers a wide range of topics in great detail. Highly recommended for anyone writing multi-threaded code in Java. |
| 1311 | <br /><a href="http://www.javaconcurrencyinpractice.com">http://www.javaconcurrencyinpractice.com</a></dd> |
| 1312 | |
| 1313 | <dt>JSR-133 (Java Memory Model) FAQ</dt> |
| 1314 | <dd>A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields. |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1315 | (A bit dated, particularly when it discusses other languages.) |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1316 | <br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html">http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html</a></dd> |
| 1317 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1318 | <dt>Validity of Program Transformations in the Java Memory Model</dt> |
| 1319 | <dd>A rather technical explanation of remaining problems with the |
| 1320 | Java memory model. These issues do not apply to data-race-free |
| 1321 | programs. |
| 1322 | <br /><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.1790&type=pdf">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.1790&type=pdf</a></dd> |
| 1323 | |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1324 | <dt>Overview of package java.util.concurrent</dt> |
| 1325 | <dd>The documentation for the <code>java.util.concurrent</code> package. Near the bottom of the page is a section entitled “Memory Consistency Properties” that explains the guarantees made by the various classes. |
| 1326 | <br />{@link java.util.concurrent java.util.concurrent} Package Summary</dd> |
| 1327 | |
| 1328 | <dt>Java Theory and Practice: Safe Construction Techniques in Java</dt> |
| 1329 | <dd>This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors. |
| 1330 | <br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp0618.html">http://www.ibm.com/developerworks/java/library/j-jtp0618.html</a></dd> |
| 1331 | |
| 1332 | <dt>Java Theory and Practice: Managing Volatility</dt> |
| 1333 | <dd>A nice article describing what you can and can’t accomplish with volatile fields in Java. |
| 1334 | <br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp06197.html">http://www.ibm.com/developerworks/java/library/j-jtp06197.html</a></dd> |
| 1335 | |
| 1336 | <dt>The “Double-Checked Locking is Broken” Declaration</dt> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1337 | <dd>Bill Pugh’s detailed explanation of the various ways in which double-checked locking is broken without <code>volatile</code> or <code>atomic</code>. |
| 1338 | Includes C/C++ and Java. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1339 | <br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html</a></dd> |
| 1340 | |
| 1341 | <dt>[ARM] Barrier Litmus Tests and Cookbook</dt> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1342 | <dd>A discussion of ARM SMP issues, illuminated with short snippets of ARM code. If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this. Also describes the instructions used for memory barriers on executable code (possibly useful if you’re generating code on the fly). Note that this predates ARMv8, which also |
| 1343 | supports an additional set of memory ordering instructions. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1344 | <br /><a href="http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf">http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf</a></dd> |
| 1345 | |
| 1346 | <dt>Linux Kernel Memory Barriers |
| 1347 | <dd>Documentation for Linux kernel memory barriers. Includes some useful examples and ASCII art. |
| 1348 | <br/><a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a></dd> |
| 1349 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1350 | <dt>ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), section 1.10 and clause 29 (“Atomic operations library”)</dt> |
| 1351 | <dd>Draft standard for C++ atomic operation features. This version is |
| 1352 | close to the C++14 standard, which includes minor changes in this area |
| 1353 | from C++11. |
| 1354 | <br /><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4527.pdf">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4527.pdf</a> |
| 1355 | <br />(intro: <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf</a>)</dd> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1356 | |
| 1357 | <dt>ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (“Atomics <stdatomic.h>”)</dt> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1358 | <dd>Draft standard for ISO/IEC 9899-201x C atomic operation features. |
| 1359 | For details, also check later defect reports. |
| 1360 | <br /><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf</a></dd> |
| 1361 | |
| 1362 | <dt>C/C++11 mappings to processors (University of Cambridge)</dt> |
| 1363 | <dd>Jaroslav Sevcik and Peter Sewell's collection of translations |
| 1364 | of C++ atomics to various common processor instruction sets. |
| 1365 | <br /><a href="http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html"> |
| 1366 | http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html</a></dd> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1367 | |
| 1368 | <dt>Dekker’s algorithm</dt> |
| 1369 | <dd>The “first known correct solution to the mutual exclusion problem in concurrent programming”. The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware. |
| 1370 | <br /><a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">http://en.wikipedia.org/wiki/Dekker's_algorithm</a></dd> |
| 1371 | |
| 1372 | <dt>Comments on ARM vs. Alpha and address dependencies</dt> |
| 1373 | <dd>An e-mail on the arm-kernel mailing list from Catalin Marinas. Includes a nice summary of address and control dependencies. |
| 1374 | <br /><a href="http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html">http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html</a></dd> |
| 1375 | |
| 1376 | <dt>What Every Programmer Should Know About Memory</dt> |
| 1377 | <dd>A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper. |
| 1378 | <br /><a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a></dd> |
| 1379 | |
| 1380 | <dt>Reasoning about the ARM weakly consistent memory model</dt> |
| 1381 | <dd>This paper was written by Chong & Ishtiaq of ARM, Ltd. It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion. The definition of “observability” used here comes from this paper. |
| 1382 | <br /><a href="http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711">http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711</a></dd> |
| 1383 | |
| 1384 | <dt>The JSR-133 Cookbook for Compiler Writers</dt> |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1385 | <dd>Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation. It contains the initial set of implementation guidelines |
| 1386 | for the Java memory model that was used by many compiler writers, and is |
| 1387 | still widely cited and likely to provide insight. |
| 1388 | Unfortunately, the four fence varieties discussed here are not a good |
| 1389 | match for Android-supported architectures, and the above C++11 mappings |
| 1390 | are now a better source of precise recipes, even for Java. |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1391 | <br /><a href="http://g.oswego.edu/dl/jmm/cookbook.html">http://g.oswego.edu/dl/jmm/cookbook.html</a></dd> |
| 1392 | |
Hans Boehm | 9e629ad | 2015-09-14 13:50:00 -0700 | [diff] [blame] | 1393 | <dt>x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors</dt> |
| 1394 | <dd>A precise description of the x86 memory model. Precise descriptions of |
| 1395 | the ARM memory model are unfortunately significantly more complicated. |
| 1396 | <br /><a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf">http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf</a></dd> |
Dirk Dougherty | d589421 | 2012-11-28 18:53:10 -0800 | [diff] [blame] | 1397 | </dl> |