blob: 20d2ee064fcd4721d80e390ac4d6e632883744af [file] [log] [blame]
Dirk Doughertyd5894212012-11-28 18:53:10 -08001page.title=SMP Primer for Android
Joe Fernandez33baa5a2013-11-14 11:41:19 -08002page.tags=ndk,native
Scott Main1c2dea02013-04-10 18:59:29 -07003
Dirk Doughertyd5894212012-11-28 18:53:10 -08004page.article=true
5@jd:body
6
7<div id="tb-wrapper">
8<div id="tb">
9<h2>In this document</h2>
10<ol class="nolist">
11 <li><a href="#theory">Theory</a>
12 <ol class="nolist">
13 <li style="margin: 3px 0 0"><a href="#mem_consistency">Memory consistency models</a>
Dirk Doughertyd5894212012-11-28 18:53:10 -080014 </li>
Hans Boehm9e629ad2015-09-14 13:50:00 -070015 <li style="margin:3px 0 0"><a href="#racefree">Data-race-free programming</a>
Dirk Doughertyd5894212012-11-28 18:53:10 -080016 <ol class="nolist">
Hans Boehm9e629ad2015-09-14 13:50:00 -070017 <li style="margin:0"><a href="#dataraces">What's a "data race"?</a></li>
18 <li style="margin:0"><a href="#avoiding">Avoiding data races</a></li>
19 <li style="margin:0"><a href="#reordering">When memory reordering becomes visible</a></li>
Dirk Doughertyd5894212012-11-28 18:53:10 -080020 </ol>
21 </li>
22 </ol>
23 </li>
24 <li><a href="#practice">Practice</a>
25 <ol class="nolist">
26 <li style="margin:3px 0 0"><a href="#c_dont">What not to do in C</a>
27 <ol class="nolist">
28 <li style="margin:0"><a href="#volatile">C/C++ and “volatile”</a></li>
29 <li style="margin:0"><a href="#examplesc">Examples</a></li>
30 </ol>
31 </li>
32 <li style="margin:3px 0 0"><a href="#j_dont">What not to do in Java</a>
33 <ol class="nolist">
34 <li style="margin:0"><a href="#sync_volatile">“synchronized” and “volatile”</a></li>
35 <li style="margin:0"><a href="#examplesj">Examples</a></li>
36 </ol>
37 </li>
38 <li style="margin:3px 0 0"><a href="#bestpractice">What to do</a>
Dirk Doughertyd5894212012-11-28 18:53:10 -080039 </li>
40 </ol>
41 </li>
Hans Boehm9e629ad2015-09-14 13:50:00 -070042 <li><a href="#weak">A little more about weak memory orders</a>
43 <ol class="nolist">
44 <li style="margin:0"><a href="#nonracing">Non-racing accesses</a></li>
45 <li style="margin:0"><a href="#hint_only">Result is not relied upon for correctness</a></li>
46 <li style="margin:0"><a href="#unread">Atomically modified but unread data</a></li>
47 <li style="margin:0"><a href="#flag">Simple flag communication</a></li>
48 <li style="margin:0"><a href="#immutable">Immutable fields</a></li>
49 </ol>
50 </li>
Dirk Doughertyd5894212012-11-28 18:53:10 -080051 <li><a href="#closing_notes">Closing Notes</a></li>
52 <li><a href="#appendix">Appendix</a>
53 <ol class="nolist">
Dirk Doughertyd5894212012-11-28 18:53:10 -080054 <li style="margin:0"><a href="#sync_stores">Implementing synchronization stores</a></li>
55 <li style="margin:0"><a href="#more">Further reading</a></li>
56 </ol>
57 </li>
58</ol>
59</div>
60</div>
61
62<p>Android 3.0 and later platform versions are optimized to support
63multiprocessor architectures. This document introduces issues that
Hans Boehm9e629ad2015-09-14 13:50:00 -070064can arise when writing multithreaded code for symmetric multiprocessor systems in C, C++, and the Java
Dirk Doughertyd5894212012-11-28 18:53:10 -080065programming language (hereafter referred to simply as “Java” for the sake of
Hans Boehm9e629ad2015-09-14 13:50:00 -070066brevity). It's intended as a primer for Android app developers, not as a complete
67discussion on the subject.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -080068
69<h2 id="intro">Introduction</h2>
70
71<p>SMP is an acronym for “Symmetric Multi-Processor”. It describes a design in
72which two or more identical CPU cores share access to main memory. Until
73a few years ago, all Android devices were UP (Uni-Processor).</p>
74
Hans Boehm9e629ad2015-09-14 13:50:00 -070075<p>Most &mdash; if not all &mdash; Android devices always had multiple CPUs, but
76in the past only one of them was used to run applications while others manage various bits of device
77hardware (for example, the radio). The CPUs may have had different architectures, and the
78programs running on them couldn’t use main memory to communicate with each
Dirk Doughertyd5894212012-11-28 18:53:10 -080079other.</p>
80
81<p>Most Android devices sold today are built around SMP designs,
Hans Boehm9e629ad2015-09-14 13:50:00 -070082making things a bit more complicated for software developers. Race conditions
83in a multi-threaded program may not cause visible problems on a uniprocessor,
84but may fail regularly when two or more of your threads
85are running simultaneously on different cores.
86What’s more, code may be more or less prone to failures when run on different
87processor architectures, or even on different implementations of the same
88architecture. Code that has been thoroughly tested on x86 may break badly on ARM.
89Code may start to fail when recompiled with a more modern compiler.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -080090
91<p>The rest of this document will explain why, and tell you what you need to do
92to ensure that your code behaves correctly.</p>
93
94
Hans Boehm9e629ad2015-09-14 13:50:00 -070095<h2 id="theory">Memory consistency models: Why SMPs are a bit different</h2>
Dirk Doughertyd5894212012-11-28 18:53:10 -080096
97<p>This is a high-speed, glossy overview of a complex subject. Some areas will
Hans Boehm9e629ad2015-09-14 13:50:00 -070098be incomplete, but none of it should be misleading or wrong. As you
99will see in the next section, the details here are usually not important.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800100
101<p>See <a href="#more">Further reading</a> at the end of the document for
102pointers to more thorough treatments of the subject.</p>
103
Dirk Doughertyd5894212012-11-28 18:53:10 -0800104<p>Memory consistency models, or often just “memory models”, describe the
Hans Boehm9e629ad2015-09-14 13:50:00 -0700105guarantees the programming language or hardware architecture
106makes about memory accesses. For example,
Dirk Doughertyd5894212012-11-28 18:53:10 -0800107if you write a value to address A, and then write a value to address B, the
108model might guarantee that every CPU core sees those writes happen in that
109order.</p>
110
111<p>The model most programmers are accustomed to is <em>sequential
112consistency</em>, which is described like this <span
113style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Adve &
114Gharachorloo</a>)</span>:</p>
115
116<ul>
117<li>All memory operations appear to execute one at a time</li>
Hans Boehm9e629ad2015-09-14 13:50:00 -0700118<li>All operations in a single thread appear to execute in the order described
Dirk Doughertyd5894212012-11-28 18:53:10 -0800119by that processor's program.</li>
120</ul>
121
Hans Boehm9e629ad2015-09-14 13:50:00 -0700122<p>Let's assume temporarily that we have a very simple compiler or interpreter
123that introduces no surprises: It translates
124assignments in the source code to load and store instructions in exactly the
125corresponding order, one instruction per access. We'll also assume for
126simplicity that each thread executes on its own processor.
127
Dirk Doughertyd5894212012-11-28 18:53:10 -0800128<p>If you look at a bit of code and see that it does some reads and writes from
129memory, on a sequentially-consistent CPU architecture you know that the code
130will do those reads and writes in the expected order. It’s possible that the
131CPU is actually reordering instructions and delaying reads and writes, but there
132is no way for code running on the device to tell that the CPU is doing anything
Hans Boehm9e629ad2015-09-14 13:50:00 -0700133other than execute instructions in a straightforward manner. (We’ll ignore
134memory-mapped device driver I/O.)</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800135
136<p>To illustrate these points it’s useful to consider small snippets of code,
Hans Boehm9e629ad2015-09-14 13:50:00 -0700137commonly referred to as <em>litmus tests</em>.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800138
139<p>Here’s a simple example, with code running on two threads:</p>
140
141<table>
142<tr>
143<th>Thread 1</th>
144<th>Thread 2</th>
145</tr>
146<tr>
147<td><code>A = 3<br />
148B = 5</code></td>
149<td><code>reg0 = B<br />
150reg1 = A</code></td>
151</tr>
152</table>
153
154<p>In this and all future litmus examples, memory locations are represented by
155capital letters (A, B, C) and CPU registers start with “reg”. All memory is
156initially zero. Instructions are executed from top to bottom. Here, thread 1
157stores the value 3 at location A, and then the value 5 at location B. Thread 2
158loads the value from location B into reg0, and then loads the value from
159location A into reg1. (Note that we’re writing in one order and reading in
160another.)</p>
161
162<p>Thread 1 and thread 2 are assumed to execute on different CPU cores. You
163should <strong>always</strong> make this assumption when thinking about
164multi-threaded code.</p>
165
166<p>Sequential consistency guarantees that, after both threads have finished
167executing, the registers will be in one of the following states:</p>
168
169
170<table>
171<tr>
172<th>Registers</th>
173<th>States</th>
174</tr>
175<tr>
176<td>reg0=5, reg1=3</td>
177<td>possible (thread 1 ran first)</td>
178</tr>
179<tr>
180<td>reg0=0, reg1=0</td>
181<td>possible (thread 2 ran first)</td>
182</tr>
183<tr>
184<td>reg0=0, reg1=3</td>
185<td>possible (concurrent execution)</td>
186</tr>
187<tr>
188<td>reg0=5, reg1=0</td>
189<td>never</td>
190</tr>
191</table>
192
193<p>To get into a situation where we see B=5 before we see the store to A, either
194the reads or the writes would have to happen out of order. On a
195sequentially-consistent machine, that can’t happen.</p>
196
Hans Boehm9e629ad2015-09-14 13:50:00 -0700197<p>Uni-processors, including x86 and ARM, are normally sequentially consistent.
198Threads appear to execute in interleaved fashion, as the OS kernel switches
199between them. Most SMP systems, including x86 and ARM,
200are not sequentially consistent. For example, it is common for
201hardware to buffer stores on their way to memory, so that they
202don't immediately reach memory and become visible to other cores.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800203
Hans Boehm9e629ad2015-09-14 13:50:00 -0700204<p>The details vary substantially. For example, x86, though not sequentially
205consistent, still guarantees that reg0 = 5 and reg1 = 0 remains impossible.
206Stores are buffered, but their order is maintained.
207ARM, on the other hand, does not. The order of buffered stores is not
208maintained, and stores may not reach all other cores at the same time.
209These differences are important to assembly programmers.
210However, as we will see below, C, C++, or Java programmers can
211and should program in a way that hides such architectural differences.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800212
Hans Boehm9e629ad2015-09-14 13:50:00 -0700213<p>So far, we've unrealistically assumed that it is only the hardware that
214reorders instructions. In reality, the compiler also reorders instructions to
215improve performance. In our example, the compiler might decide that some later
216code in Thread 2 needed the value of reg1 before it needed reg0, and thus load
217reg1 first. Or some prior code may already have loaded A, and the compiler
218might decide to reuse that value instead of loading A again. In either case,
219the loads to reg0 and reg1 might be reordered.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800220
Hans Boehm9e629ad2015-09-14 13:50:00 -0700221<p>Reordering accesses to different memory locations,
222either in the hardware, or in the compiler, is
223allowed, since it doesn't affect the execution of a single thread, and
224it can significantly improve performance. As we will see, with a bit of care,
225we can also prevent it from affecting the results of multithreaded programs.</p>
226
227<p>Since compilers can also reorder memory accesses, this problem is actually
228not new to SMPs. Even on a uniprocessor, a compiler could reorder the loads to
229reg0 and reg1 in our example, and Thread 1 could be scheduled between the
230reordered instructions. But if our compiler happened to not reorder, we might
231never observe this problem. On most ARM SMPs, even without compiler
232reordering, the reordering will probably be seen, possibly after a very large
233number of successful executions. Unless you're programming in assembly
234language, SMPs generally just make it more likely you'll see problems that were
235there all along.</p>
236
237<h2 id="racefree">Data-race-free programming</h2>
238
239<p>Fortunately, there is usually an easy way to avoid thinking about any of
240these details. If you follow some straightforward rules, it's usually safe
241to forget all of the preceding section except the "sequential consistency" part.
242Unfortunately, the other complications may become visible if you
243accidentally violate those rules.
244
245<p>Modern programming languages encourage what's known as a "data-race-free"
246programming style. So long as you promise not to introduce "data races",
247and avoid a handful of constructs that tell the compiler otherwise, the compiler
248and hardware promise to provide sequentially consistent results. This doesn't
249really mean they avoid memory access reordering. It does mean that if you
250follow the rules you won't be able to tell that memory accesses are being
251reordered. It's a lot like telling you that sausage is a delicious and
252appetizing food, so long as you promise not to visit the
253sausage factory. Data races are what expose the ugly truth about memory
254reordering.</p>
255
256<h3 id="dataraces">What's a "data race"?</h3>
257
258<p>A <i>data race</i> occurs when at least two threads simultaneously access
259the same ordinary data, and at least one of them modifies it. By "ordinary
260data" we mean something that's not specifically a synchronization object
261intended for thread communication. Mutexes, condition variables, Java
262volatiles, or C++ atomic objects are not ordinary data, and their accesses
263are allowed to race. In fact they are used to prevent data races on other
264objects.</p>
265
266<p>In order to determine whether two threads simultaneously access the same
267memory location, we can ignore the memory-reordering discussion from above, and
268assume sequential consistency. The following program doesn't have a data race
269if <code>A</code> and <code>B</code> are ordinary boolean variables that are
270initially false:</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800271
272<table>
273<tr>
274<th>Thread 1</th>
275<th>Thread 2</th>
276</tr>
277<tr>
Hans Boehm9e629ad2015-09-14 13:50:00 -0700278<td><code>if (A) B = true</code></td>
279<td><code>if (B) A = true</code></td>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800280</tr>
281</table>
282
Hans Boehm9e629ad2015-09-14 13:50:00 -0700283<p>Since operations are not reordered, both conditions will evaluate to false, and
284neither variable is ever updated. Thus there cannot be a data race. There is
285no need to think about what might happen if the load from <code>A</code>
286and store to <code>B</code> in
287Thread 1 were somehow reordered. The compiler is not allowed to reorder Thread
2881 by rewriting it as "<code>B = true; if (!A) B = false</code>". That would be
289like making sausage in the middle of town in broad daylight.
Dirk Doughertyd5894212012-11-28 18:53:10 -0800290
Hans Boehm9e629ad2015-09-14 13:50:00 -0700291<p>Data races are officially defined on basic built-in types like integers and
292references or pointers. Assigning to an <code>int</code> while simultaneously
293reading it in another thread is clearly a data race. But both the C++
294standard library and
295the Java Collections libraries are written to allow you to also reason about
296data races at the library level. They promise to not introduce data races
297unless there are concurrent accesses to the same container, at least one of
298which updates it. Updating a <code>set&lt;T&gt;</code> in one thread while
299simultaneously reading it in another allows the library to introduce a
300data race, and can thus be thought of informally as a "library-level data race".
301Conversely, updating one <code>set&lt;T&gt;</code> in one thread, while reading
302a different one in another, does not result in a data race, because the
303library promises not to introduce a (low-level) data race in that case.
304
305<p>Normally concurrent accesses to different fields in a data structure
306cannot introduce a data race. However there is one important exception to
307this rule: Contiguous sequences of bit-fields in C or C++ are treated as
308a single "memory location". Accessing any bit-field in such a sequence
309is treated as accessing all of them for purposes of determining the
310existence of a data race. This reflects the inability of common hardware
311to update individual bits without also reading and re-writing adjacent bits.
312Java programmers have no analogous concerns.</p>
313
314<h3 id="avoiding">Avoiding data races</h3>
315
316Modern programming languages provide a number of synchronization
317mechanisms to avoid data races. The most basic tools are:
318
319<dl>
320<dt>Locks or Mutexes</dt>
321
322<dd>Mutexes (C++11 <code>std::mutex</code>, or <code>pthread_mutex_t</code>), or
323<code>synchronized</code> blocks in Java can be used to ensure that certain
324section of code do not run concurrently with other sections of code accessing
325the same data. We'll refer to these and other similar facilities generically
326as "locks." Consistently acquiring a specific lock before accessing a shared
327data structure and releasing it afterwards, prevents data races when accessing
328the data structure. It also ensures that updates and accesses are atomic, i.e. no
329other update to the data structure can run in the middle. This is deservedly
330by far the most common tool for preventing data races. The use of Java
331<code>synchronized</code> blocks or C++ <code>lock_guard</code>
332or <code>unique_lock</code> ensure that locks are properly released in the
333event of an exception.
334</dd>
335
336<dt>Volatile/atomic variables</dt>
337
338<dd>Java provides <code>volatile</code> fields that support concurrent access
339without introducing data races. Since 2011, C and C++ support
340<code>atomic</code> variables and fields with similar semantics. These are
341typically more difficult to use then locks, since they only ensure that
342individual accesses to a single variable are atomic. (In C++ this normally
343extends to simple read-modify-write operations, like increments. Java
344requires special method calls for that.)
345Unlike locks, <code>volatile</code> or <code>atomic</code> variables can't
346be used directly to prevent other threads from interfering with longer code sequences.</dd>
347
348</dl>
349
350<p>It's important to note that <code>volatile</code> has very different
351meanings in C++ and Java. In C++, <code>volatile</code> does not prevent data
352races, though older code often uses it as a workaround for the lack of
353<code>atomic</code> objects. This is no longer recommended; in
354C++, use <code>atomic&lt;T&gt;</code> for variables that can be concurrently
355accessed by multiple threads. C++ <code>volatile</code> is meant for
356device registers and the like.</p>
357
358<p>C/C++ <code>atomic</code> variables or Java <code>volatile</code> variables
359can be used to prevent data races on other variables. If <code>flag</code> is
360declared to have type <code>atomic&lt;bool&gt;</code>
361or <code>atomic_bool</code>(C/C++) or <code>volatile boolean</code> (Java),
362and is initially false then the following snippet is data-race-free:</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800363
364<table>
365<tr>
366<th>Thread 1</th>
367<th>Thread 2</th>
368</tr>
369<tr>
Hans Boehm9e629ad2015-09-14 13:50:00 -0700370<td><code>A = ...<br />
371&nbsp;&nbsp;flag = true</code>
372</td>
373<td><code>while (!flag) {}<br />
374... = A</code>
375</td>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800376</tr>
377</table>
378
Hans Boehm9e629ad2015-09-14 13:50:00 -0700379<p>Since Thread 2 waits for <code>flag</code> to be set, the access to
380<code>A</code> in Thread 2 must happen after, and not concurrently with, the
381assignment to <code>A</code> in Thread 1. Thus there is no data race on
382<code>A</code>. The race on <code>flag</code> doesn't count as a data race,
383since volatile/atomic accesses are not "ordinary memory accesses".</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800384
Hans Boehm9e629ad2015-09-14 13:50:00 -0700385<p>The implementation is required to prevent or hide memory reordering
386sufficiently to make code like the preceding litmus test behave as expected.
387This normally makes volatile/atomic memory accesses
388substantially more expensive than ordinary accesses.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800389
Hans Boehm9e629ad2015-09-14 13:50:00 -0700390<p>Although the preceding example is data-race-free, locks together with
391<code>Object.wait()</code> in Java or condition variables in C/C++ usually
392provide a better solution that does not involve waiting in a loop while
393draining battery power.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800394
Hans Boehm9e629ad2015-09-14 13:50:00 -0700395<h3 id="reordering">When memory reordering becomes visible</h3>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800396
Hans Boehm9e629ad2015-09-14 13:50:00 -0700397Data-race-free programming normally saves us from having to explicitly deal
398with memory access reordering issues. However, there are several cases in
399which reordering does become visible:
Dirk Doughertyd5894212012-11-28 18:53:10 -0800400
401<ol>
Hans Boehm9e629ad2015-09-14 13:50:00 -0700402<li> If your program has a bug resulting in an unintentional data race,
403compiler and hardware transformations can become visible, and the behavior
404of your program may be surprising. For example, if we forgot to declare
405<code>flag</code> volatile in the preceding example, Thread 2 may see an
406uninitialized <code>A</code>. Or the compiler may decide that flag can't
407possibly change during Thread 2's loop and transform the program to
408
409<table>
410<tr>
411<th>Thread 1</th>
412<th>Thread 2</th>
413</tr>
414<tr>
415<td><code>A = ...<br />
416&nbsp;&nbsp;flag = true</code>
417</td>
418<td>reg0 = flag;
419while (!reg0) {}<br />
420... = A
421</td>
422</tr>
423</table>
424
425When you debug, you may well see the loop continuing forever in spite of
426the fact that <code>flag</code> is true.</li>
427
428<li> C++ provides facilities for explicitly relaxing
429sequential consistency even if there are no races. Atomic operations
430can take explicit <code>memory_order_</code>... arguments. Similarly, the
431<code>java.util.concurrent.atomic</code> package provides a more restricted
432set of similar facilities, notably <code>lazySet()</code>. And Java
433programmers occasionally use intentional data races for similar effect.
434All of these provide performance improvements at a large
435cost in programming complexity. We discuss them only briefly
436<a href="#weak">below</a>.</li>
437
438<li> Some C and C++ code is written in an older style, not entirely
439consistent with current language standards, in which <code>volatile</code>
440variables are used instead of <code>atomic</code> ones, and memory ordering
441is explicitly disallowed by inserting so called <i>fences</i> or
442<i>barriers</i>. This requires explicit reasoning about access
443reordering and understanding of hardware memory models. A coding style
444along these lines is still used in the Linux kernel. It should not
445be used in new Android applications, and is also not further discussed here.
446</li>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800447</ol>
448
Dirk Doughertyd5894212012-11-28 18:53:10 -0800449<h2 id="practice">Practice</h2>
450
451<p>Debugging memory consistency problems can be very difficult. If a missing
Hans Boehm9e629ad2015-09-14 13:50:00 -0700452lock, <code>atomic</code> or <code>volatile</code> declaration causes
453some code to read stale data, you may not be able to
Dirk Doughertyd5894212012-11-28 18:53:10 -0800454figure out why by examining memory dumps with a debugger. By the time you can
Hans Boehm9e629ad2015-09-14 13:50:00 -0700455issue a debugger query, the CPU cores may have all observed the full set of
Dirk Doughertyd5894212012-11-28 18:53:10 -0800456accesses, and the contents of memory and the CPU registers will appear to be in
457an “impossible” state.</p>
458
459<h3 id="c_dont">What not to do in C</h3>
460
461<p>Here we present some examples of incorrect code, along with simple ways to
462fix them. Before we do that, we need to discuss the use of a basic language
463feature.</p>
464
David Friedman23bd5922013-09-26 22:30:11 -0700465<h4 id="volatile">C/C++ and "volatile"</h4>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800466
Hans Boehm9e629ad2015-09-14 13:50:00 -0700467<p>C and C++ <code>volatile</code> declarations are a very special purpose tool.
468They prevent <i>the compiler</i> from reordering or removing <i>volatile</i>
469accesses. This can be helpful for code accessing hardware device registers,
470memory mapped to more than one location, or in connection with
471<code>setjmp</code>. But C and C++ <code>volatile</code>, unlike Java
472<code>volatile</code>, is not designed for thread communication.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800473
Hans Boehm9e629ad2015-09-14 13:50:00 -0700474<p>In C and C++, accesses to <code>volatile</code>
475data may be reordered with accessed to non-volatile data, and there are no
476atomicity guarantees. Thus <code>volatile</code> can't be used for sharing data between
477threads in portable code, even on a uniprocessor. C <code>volatile</code> usually does not
478prevent access reordering by the hardware, so by itself it is even less useful in
479multi-threaded SMP environments. This is the reason C11 and C++11 support
480<code>atomic</code> objects. You should use those instead.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800481
Hans Boehm9e629ad2015-09-14 13:50:00 -0700482<p>A lot of older C and C++ code still abuses <code>volatile</code> for thread
483communication. This often works correctly for data that fits
484in a machine register, provided it is used with either explicit fences or in cases
485in which memory ordering is not important. But it is not guaranteed to work
486correctly with future compilers.</p>
487
Dirk Doughertyd5894212012-11-28 18:53:10 -0800488
489<h4 id="examplesc">Examples</h4>
490
Hans Boehm9e629ad2015-09-14 13:50:00 -0700491<p>In most cases you’d be better off with a lock (like a
492<code>pthread_mutex_t</code> or C++11 <code>std::mutex</code>) rather than an
493atomic operation, but we will employ the latter to illustrate how they would be
494used in a practical situation.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800495
Dirk Doughertyd5894212012-11-28 18:53:10 -0800496
Hans Boehm9e629ad2015-09-14 13:50:00 -0700497<pre>MyThing* gGlobalThing = NULL; // Wrong! See below.
Dirk Doughertyd5894212012-11-28 18:53:10 -0800498
Hans Boehm9e629ad2015-09-14 13:50:00 -0700499void initGlobalThing() // runs in Thread 1
Dirk Doughertyd5894212012-11-28 18:53:10 -0800500{
501 MyStruct* thing = malloc(sizeof(*thing));
502 memset(thing, 0, sizeof(*thing));
Hans Boehm9e629ad2015-09-14 13:50:00 -0700503 thing-&gt;x = 5;
504 thing-&gt;y = 10;
Dirk Doughertyd5894212012-11-28 18:53:10 -0800505 /* initialization complete, publish */
506 gGlobalThing = thing;
507}
508
Hans Boehm9e629ad2015-09-14 13:50:00 -0700509void useGlobalThing() // runs in Thread 2
Dirk Doughertyd5894212012-11-28 18:53:10 -0800510{
511 if (gGlobalThing != NULL) {
Hans Boehm9e629ad2015-09-14 13:50:00 -0700512 int i = gGlobalThing-&gt;x; // could be 5, 0, or uninitialized data
Dirk Doughertyd5894212012-11-28 18:53:10 -0800513 ...
514 }
515}</pre>
516
517<p>The idea here is that we allocate a structure, initialize its fields, and at
518the very end we “publish” it by storing it in a global variable. At that point,
519any other thread can see it, but that’s fine since it’s fully initialized,
Hans Boehm9e629ad2015-09-14 13:50:00 -0700520right?</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800521
Hans Boehm9e629ad2015-09-14 13:50:00 -0700522<p>The problem is that the store to <code>gGlobalThing</code> could be observed
523before the fields are initialized, typically because either the compiler or the
524processor reordered the stores to <code>gGlobalThing</code> and
525<code>thing-&gt;x</code>. Another thread reading from <code>thing-&gt;x</code> could
Dirk Doughertyd5894212012-11-28 18:53:10 -0800526see 5, 0, or even uninitialized data.</p>
527
Hans Boehm9e629ad2015-09-14 13:50:00 -0700528<p>The core problem here is a data race on <code>gGlobalThing</code>.
529If Thread 1 calls <code>initGlobalThing()</code> while Thread 2
530calls <code>useGlobalThing()</code>, <code>gGlobalThing</code> can be
531read while being written.
Dirk Doughertyd5894212012-11-28 18:53:10 -0800532
Hans Boehm9e629ad2015-09-14 13:50:00 -0700533<p>This can be fixed by declaring <code>gGlobalThing</code> as
534atomic. In C++11:</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800535
Hans Boehm9e629ad2015-09-14 13:50:00 -0700536<pre>atomic&lt;MyThing*&gt; gGlobalThing(NULL);</pre>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800537
Hans Boehm9e629ad2015-09-14 13:50:00 -0700538<p>This ensures that the writes will become visible to other threads
539in the proper order. It also guarantees to prevent some other failure
540modes that are otherwise allowed, but unlikely to occur on real
541Android hardware. For example, it ensures that we cannot see a
542<code>gGlobalThing</code> pointer that has only been partially written.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800543
544<h3 id="j_dont">What not to do in Java</h3>
545
546<p>We haven’t discussed some relevant Java language features, so we’ll take a
547quick look at those first.</p>
548
Hans Boehm9e629ad2015-09-14 13:50:00 -0700549<p>Java technically does not require code to be data-race-free. And there
550is a small amount of very-carefully-written Java code that works correctly
551in the presence of data races. However, writing such code is extremely
552tricky, and we discuss it only briefly below. To make matters
553worse, the experts who specified the meaning of such code no longer believe the
554specification is correct. (The specification is fine for data-race-free
555code.)
556
557<p>For now we will adhere to the data-race-free model, for which Java provides
558essentially the same guarantees as C and C++. Again, the language provides
559some primitives that explicitly relax sequential consistency, notably the
560<code>lazySet()</code> and <code>weakCompareAndSet()</code> calls
561in <code>java.util.concurrent.atomic</code>.
562As with C and C++, we will ignore these for now.
563
Dirk Doughertyd5894212012-11-28 18:53:10 -0800564<h4 id="sync_volatile">Java's "synchronized" and "volatile" keywords</h4>
565
566<p>The “synchronized” keyword provides the Java language’s in-built locking
567mechanism. Every object has an associated “monitor” that can be used to provide
Hans Boehm9e629ad2015-09-14 13:50:00 -0700568mutually exclusive access. If two threads try to "synchronize" on the
569same object, one of them will wait until the other completes.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800570
Hans Boehm9e629ad2015-09-14 13:50:00 -0700571<p>As we mentioned above, Java's <code>volatile T</code> is the analog of
572C++11's <code>atomic&lt;T&gt;</code>. Concurrent accesses to
573<code>volatile</code> fields are allowed, and don't result in data races.
574Ignoring <code>lazySet()</code> et al. and data races, it is the Java VM's job to
575make sure that the result still appears sequentially consistent.
Dirk Doughertyd5894212012-11-28 18:53:10 -0800576
Hans Boehm9e629ad2015-09-14 13:50:00 -0700577<p>In particular, if thread 1 writes to a <code>volatile</code> field, and
578thread 2 subsequently reads from that same field and sees the newly written
579value, then thread 2 is also guaranteed to see all writes previously made by
580thread 1. In terms of memory effect, writing to
581a volatile is analogous to a monitor release, and
582reading from a volatile is like a monitor acquire.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800583
Hans Boehm9e629ad2015-09-14 13:50:00 -0700584<p>There is one notable difference from C++'s <code>atomic</code>:
585If we write <code>volatile int x;</code>
586in Java, then <code>x++</code> is the same as <code>x = x + 1</code>; it
587performs an atomic load, increments the result, and then performs an atomic
588store. Unlike C++, the increment as a whole is not atomic.
589Atomic increment operations are instead provided by
590the <code>java.util.concurrent.atomic</code>.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800591
592<h4 id="examplesj">Examples</h4>
593
Hans Boehm9e629ad2015-09-14 13:50:00 -0700594<p>Here’s a simple, <em>incorrect</em> implementation of a monotonic counter: <span
Dirk Doughertyd5894212012-11-28 18:53:10 -0800595style="font-size:.9em;color:#777">(<em><a href="#more" style="color:#777">Java
596theory and practice: Managing volatility</a></em>)</span>.</p>
597
598<pre>class Counter {
599 private int mValue;
600
601 public int get() {
602 return mValue;
603 }
604 public void incr() {
605 mValue++;
606 }
607}</pre>
608
609<p>Assume <code>get()</code> and <code>incr()</code> are called from multiple
610threads, and we want to be sure that every thread sees the current count when
611<code>get()</code> is called. The most glaring problem is that
612<code>mValue++</code> is actually three operations:</p>
613
614<ol>
615<li><code>reg = mValue</code></li>
616<li><code>reg = reg + 1</code></li>
617<li><code>mValue = reg</code></li>
618</ol>
619
620<p>If two threads execute in <code>incr()</code> simultaneously, one of the
621updates could be lost. To make the increment atomic, we need to declare
Hans Boehm9e629ad2015-09-14 13:50:00 -0700622<code>incr()</code> “synchronized”.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800623
Hans Boehm9e629ad2015-09-14 13:50:00 -0700624<p>It’s still broken however, especially on SMP. There is still a data race,
625in that <code>get()</code> can access <code>mValue</code> concurrently with
626<code>incr()</code>. Under Java rules, the <code>get()</code> call can be
627appear to be reordered with respect to other code. For example, if we read two
628counters in a row, the results might appear to be inconsistent
629because the <code>get()</code> calls we reordered, either by the hardware or
630compiler. We can correct the problem by declaring <code>get()</code> to be
631synchronized. With this change, the code is obviously correct.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800632
633<p>Unfortunately, we’ve introduced the possibility of lock contention, which
634could hamper performance. Instead of declaring <code>get()</code> to be
635synchronized, we could declare <code>mValue</code> with “volatile”. (Note
Hans Boehm9e629ad2015-09-14 13:50:00 -0700636<code>incr()</code> must still use <code>synchronize</code> since
637<code>mValue++</code> is otherwise not a single atomic operation.)
638This also avoids all data races, so sequential consistency is preserved.
639<code>incr()</code> will be somewhat slower, since it incurs both monitor entry/exit
640overhead, and the overhead associated with a volatile store, but
Dirk Doughertyd5894212012-11-28 18:53:10 -0800641<code>get()</code> will be faster, so even in the absence of contention this is
Hans Boehm9e629ad2015-09-14 13:50:00 -0700642a win if reads greatly outnumber writes. (See also {@link
643java.util.concurrent.atomic.AtomicInteger} for a way to completely
644remove the synchronized block.)</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800645
646<p>Here’s another example, similar in form to the earlier C examples:</p>
647
648<pre>class MyGoodies {
649 public int x, y;
650}
651class MyClass {
652 static MyGoodies sGoodies;
653
654 void initGoodies() { // runs in thread 1
655 MyGoodies goods = new MyGoodies();
656 goods.x = 5;
657 goods.y = 10;
658 sGoodies = goods;
659 }
660
661 void useGoodies() { // runs in thread 2
662 if (sGoodies != null) {
663 int i = sGoodies.x; // could be 5 or 0
664 ....
665 }
666 }
667}</pre>
668
Hans Boehm9e629ad2015-09-14 13:50:00 -0700669<p>This has the same problem as the C code, namely that there is
670a data race on <code>sGoodies</code>. Thus the assignment
Dirk Doughertyd5894212012-11-28 18:53:10 -0800671<code>sGoodies = goods</code> might be observed before the initialization of the
672fields in <code>goods</code>. If you declare <code>sGoodies</code> with the
Hans Boehm9e629ad2015-09-14 13:50:00 -0700673<code>volatile</code> keyword, sequential consistency is restored, and things will work
674as expected.
Dirk Doughertyd5894212012-11-28 18:53:10 -0800675
Hans Boehm9e629ad2015-09-14 13:50:00 -0700676<p>Note that only the <code>sGoodies</code> reference itself is volatile. The
677accesses to the fields inside it are not. Once <code>sGoodies</code> is
678<code>volatile</code>, and memory ordering is properly preserved, the fields
679cannot be concurrently accessed. The statement <code>z =
Dirk Doughertyd5894212012-11-28 18:53:10 -0800680sGoodies.x</code> will perform a volatile load of <code>MyClass.sGoodies</code>
681followed by a non-volatile load of <code>sGoodies.x</code>. If you make a local
Hans Boehm9e629ad2015-09-14 13:50:00 -0700682reference <code>MyGoodies localGoods = sGoodies</code>, then a subsequent <code>z =
683localGoods.x</code> will not perform any volatile loads.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800684
685<p>A more common idiom in Java programming is the infamous “double-checked
686locking”:</p>
687
688<pre>class MyClass {
689 private Helper helper = null;
690
691 public Helper getHelper() {
692 if (helper == null) {
693 synchronized (this) {
694 if (helper == null) {
695 helper = new Helper();
696 }
697 }
698 }
699 return helper;
700 }
701}</pre>
702
703<p>The idea is that we want to have a single instance of a <code>Helper</code>
704object associated with an instance of <code>MyClass</code>. We must only create
705it once, so we create and return it through a dedicated <code>getHelper()</code>
706function. To avoid a race in which two threads create the instance, we need to
707synchronize the object creation. However, we don’t want to pay the overhead for
708the “synchronized” block on every call, so we only do that part if
709<code>helper</code> is currently null.</p>
710
Hans Boehm9e629ad2015-09-14 13:50:00 -0700711<p>This has a data race on the <code>helper</code> field. It can be
712set concurrently with the <code>helper == null</code> in another thread.
713</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800714
Hans Boehm9e629ad2015-09-14 13:50:00 -0700715<p>To see how this can fail, consider
Dirk Doughertyd5894212012-11-28 18:53:10 -0800716the same code rewritten slightly, as if it were compiled into a C-like language
717(I’ve added a couple of integer fields to represent <code>Helper’s</code>
718constructor activity):</p>
719
720<pre>if (helper == null) {
Hans Boehm9e629ad2015-09-14 13:50:00 -0700721 synchronized() {
722 if (helper == null) {
723 newHelper = malloc(sizeof(Helper));
724 newHelper-&gt;x = 5;
725 newHelper-&gt;y = 10;
726 helper = newHelper;
727 }
Dirk Doughertyd5894212012-11-28 18:53:10 -0800728 }
Hans Boehm9e629ad2015-09-14 13:50:00 -0700729 return helper;
Dirk Doughertyd5894212012-11-28 18:53:10 -0800730}</pre>
731
Hans Boehm9e629ad2015-09-14 13:50:00 -0700732<p>There is nothing to prevent either the hardware or the compiler
733from reordering the store to <code>helper</code> with those to the
734<code>x</code>/<code>y</code> fields. Another thread could find
735<code>helper</code> non-null but its fields not yet set and ready to use.
736For more details and more failure modes, see the “‘Double Checked
737Locking is Broken’ Declaration” link in the appendix for more details, or Item
73871 (“Use lazy initialization judiciously”) in Josh Bloch’s <em>Effective Java,
7392nd Edition.</em>.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800740
741<p>There are two ways to fix this:</p>
742<ol>
743<li>Do the simple thing and delete the outer check. This ensures that we never
744examine the value of <code>helper</code> outside a synchronized block.</li>
745<li>Declare <code>helper</code> volatile. With this one small change, the code
746in Example J-3 will work correctly on Java 1.5 and later. (You may want to take
747a minute to convince yourself that this is true.)</li>
748</ol>
749
Hans Boehm9e629ad2015-09-14 13:50:00 -0700750<p>Here is another illustration of <code>volatile</code> behavior:</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800751
752<pre>class MyClass {
753 int data1, data2;
754 volatile int vol1, vol2;
755
Hans Boehm9e629ad2015-09-14 13:50:00 -0700756 void setValues() { // runs in Thread 1
Dirk Doughertyd5894212012-11-28 18:53:10 -0800757 data1 = 1;
758 vol1 = 2;
759 data2 = 3;
760 }
761
Hans Boehm9e629ad2015-09-14 13:50:00 -0700762 void useValues() { // runs in Thread 2
Dirk Doughertyd5894212012-11-28 18:53:10 -0800763 if (vol1 == 2) {
764 int l1 = data1; // okay
765 int l2 = data2; // wrong
766 }
767 }
Hans Boehm9e629ad2015-09-14 13:50:00 -0700768}</pre>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800769
Hans Boehm9e629ad2015-09-14 13:50:00 -0700770<p>Looking at <code>useValues()</code>, if Thread 2 hasn’t yet observed the
Dirk Doughertyd5894212012-11-28 18:53:10 -0800771update to <code>vol1</code>, then it can’t know if <code>data1</code> or
772<code>data2</code> has been set yet. Once it sees the update to
Hans Boehm9e629ad2015-09-14 13:50:00 -0700773<code>vol1</code>, it knows that <code>data1</code> can be safely accessed
774and correctly read without introducing a data race. However,
Dirk Doughertyd5894212012-11-28 18:53:10 -0800775it can’t make any assumptions about <code>data2</code>, because that store was
776performed after the volatile store.</p>
777
Hans Boehm9e629ad2015-09-14 13:50:00 -0700778<p>Note that <code>volatile</code> cannot be used to prevent reordering
779of other memory accesses that race with each other. It is not guaranteed to
780generate a machine memory fence instruction. It can be used to prevent
781data races by executing code only when another thread has satisfied a
782certain condition.
Dirk Doughertyd5894212012-11-28 18:53:10 -0800783
784<h3 id="bestpractice">What to do</h3>
785
Hans Boehm9e629ad2015-09-14 13:50:00 -0700786<p>In C/C++, prefer C++11
787synchronization classes, such as <code>std::mutex</code>. If not, use
788the corresponding <code>pthread</code> operations.
789These include the proper memory fences, providing correct (sequentially consistent
790unless otherwise specified)
791and efficient behavior on all Android platform versions. Be sure to use them
792correctly. For example, remember that condition variable waits may spuriously
793return without being signaled, and should thus appear in a loop.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800794
Hans Boehm9e629ad2015-09-14 13:50:00 -0700795<p>It's best to avoid using atomic functions directly, unless the data structure
796you are implementing is extremely simple, like a counter. Locking and
797unlocking a pthread mutex require a single atomic operation each,
798and often cost less than a single cache miss, if there’s no
Dirk Doughertyd5894212012-11-28 18:53:10 -0800799contention, so you’re not going to save much by replacing mutex calls with
Hans Boehm9e629ad2015-09-14 13:50:00 -0700800atomic ops. Lock-free designs for non-trivial data structures require
801much more care to ensure that higher level operations on the data structure
802appear atomic (as a whole, not just their explicitly atomic pieces).</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800803
Hans Boehm9e629ad2015-09-14 13:50:00 -0700804<p>If you do use atomic operations, relaxing ordering with
805<code>memory_order</code>... or <code>lazySet()</code> may provide performance
806advantages, but requires deeper understanding than we have conveyed so far.
807A large fraction of existing code using
808these is discovered to have bugs after the fact. Avoid these if possible.
809If your use cases doesn't exactly fit one of those in the next section,
810make sure you either are an expert, or have consulted one.
Dirk Doughertyd5894212012-11-28 18:53:10 -0800811
Hans Boehm9e629ad2015-09-14 13:50:00 -0700812<p>Avoid using <code>volatile</code> for thread communication in C/C++.</p>
813
814<p>In Java, concurrency problems are often best solved by
815using an appropriate utility class from
Dirk Doughertyd5894212012-11-28 18:53:10 -0800816the {@link java.util.concurrent} package. The code is well written and well
817tested on SMP.</p>
818
Hans Boehm9e629ad2015-09-14 13:50:00 -0700819<p>Perhaps the safest thing you can do is make your objects immutable. Objects
820from classes like Java's String and Integer hold data that cannot be changed once an
821object is created, avoiding all potential for data races on those objects.
822The book <em>Effective
Dirk Doughertyd5894212012-11-28 18:53:10 -0800823Java, 2nd Ed.</em> has specific instructions in “Item 15: Minimize Mutability”. Note in
Hans Boehm9e629ad2015-09-14 13:50:00 -0700824particular the importance of declaring Java fields “final" <span
Dirk Doughertyd5894212012-11-28 18:53:10 -0800825style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Bloch</a>)</span>.</p>
826
Hans Boehm9e629ad2015-09-14 13:50:00 -0700827<p>Even if an object is immutable, remember that communicating it to another
828thread without any kind of synchronization is a data race. This can occasionally
829be acceptable in Java (see below), but requires great care, and is likely to result in
830brittle code. If it's not extremely performance critical, add a
831<code>volatile</code> declaration. In C++, communicating a pointer or
832reference to an immutable object without proper synchronization,
833like any data race, is a bug.
834In this case, it is reasonably likely to result in intermittent crashes since,
835for example, the receiving thread may see an uninitialized method table
836pointer due to store reordering.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800837
Hans Boehm9e629ad2015-09-14 13:50:00 -0700838<p>If neither an existing library class, nor an immutable class is
839appropriate, the Java <code>synchronized</code> statement or C++
840<code>lock_guard</code> / <code>unique_lock</code> should be used to guard
841accesses to any field that can be accessed by more than one thread. If mutexes won’t
842work for your situation, you should declare shared fields
843<code>volatile</code> or <code>atomic</code>, but you must take great care to
844understand the interactions between threads. These declarations won’t
845save you from common concurrent programming mistakes, but they will help you
846avoid the mysterious failures associated with optimizing compilers and SMP
847mishaps.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800848
Hans Boehm9e629ad2015-09-14 13:50:00 -0700849<p>You should avoid
850"publishing" a reference to an object, i.e. making it available to other
851threads, in its constructor. This is less critical in C++ or if you stick to
852our "no data races" advice in Java. But it's always good advice, and becomes
853critical if your Java code is
854run in other contexts in which the Java security model matters, and untrusted
855code may introduce a data race by accessing that "leaked" object reference.
856It's also critical if you choose to ignore our warnings and use some of the techniques
857in the next section.
858See <span style="font-size:.9em;color:#777">(<a href="#more"
859style="color:#777">Safe Construction Techniques in Java</a>)</span> for
860details</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800861
Hans Boehm9e629ad2015-09-14 13:50:00 -0700862<h2 id="weak">A little more about weak memory orders</h2>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800863
Hans Boehm9e629ad2015-09-14 13:50:00 -0700864<p>C++11 and later provide explicit mechanisms for relaxing the sequential
865consistency guarantees for data-race-free programs. Explicit
866<code>memory_order_relaxed</code>, <code>memory_order_acquire</code> (loads
867only), and <code>memory_order_release</code>(stores only) arguments for atomic
868operations each provide strictly weaker guarantees than the default, typically
869implicit, <code>memory_order_seq_cst</code>. <code>memory_order_acq_rel</code>
870provides both <code>memory_order_acquire</code> and
871<code>memory_order_release</code> guarantees for atomic read-modify write
872operations. <code>memory_order_consume</code> is not yet sufficiently
873well specified or implemented to be useful, and should be ignored for now.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800874
Hans Boehm9e629ad2015-09-14 13:50:00 -0700875<p>The <code>lazySet</code> methods in <code>Java.util.concurrent.atomic</code>
876are similar to C++ <code>memory_order_release</code> stores. Java's
877ordinary variables are sometimes used as a replacement for
878<code>memory_order_relaxed</code> accesses, though they are actually
879even weaker. Unlike C++, there is no real mechanism for unordered
880accesses to variables that are declared as <code>volatile</code>.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800881
Hans Boehm9e629ad2015-09-14 13:50:00 -0700882<p>You should generally avoid these unless there are pressing performance reasons to
883use them. On weakly ordered machine architectures like ARM, using them will
884commonly save on the order of a few dozen machine cycles for each atomic operation.
885On x86, the performance win is limited to stores, and likely to be less
886noticeable.
887Somewhat counter-intuitively, the benefit may decrease with larger core counts,
888as the memory system becomes more of a limiting factor.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800889
Hans Boehm9e629ad2015-09-14 13:50:00 -0700890<p>The full semantics of weakly ordered atomics are complicated.
891In general they require
892precise understanding of the language rules, which we will
893not go into here. For example:
Dirk Doughertyd5894212012-11-28 18:53:10 -0800894
Hans Boehm9e629ad2015-09-14 13:50:00 -0700895<ul>
896<li> The compiler or hardware can move <code>memory_order_relaxed</code>
897accesses into (but not out of) a critical section bounded by a lock
898acquisition and release. This means that two
899<code>memory_order_relaxed</code> stores may become visible out of order,
900even if they are separated by a critical section.
901<li> An ordinary Java variable, when abused as a shared counter, may appear
902to another thread to decrease, even though it is only incremented by a single
903other thread. But this is not true for C++ atomic
904<code>memory_order_relaxed</code>.
905</ul>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800906
Hans Boehm9e629ad2015-09-14 13:50:00 -0700907With that as a warning,
908here we give a small number of idioms that seem to cover many of the use
909cases for weakly ordered atomics. Many of these are applicable only to C++:
Dirk Doughertyd5894212012-11-28 18:53:10 -0800910
Hans Boehm9e629ad2015-09-14 13:50:00 -0700911<h3 id="nonracing">Non-racing accesses</h3>
912<p>It is fairly common that a variable is atomic because it is <em>sometimes</em>
913read concurrently with a write, but not all accesses have this issue.
914For example a variable
915may need to be atomic because it is read outside a critical section, but all
916updates are protected by a lock. In that case, a read that happens to be
917protected by the same lock
918cannot race, since there cannot be concurrent writes. In such a case, the
919non-racing access (load in this case), can be annotated with
920<code>memory_order_relaxed</code> without changing the correctness of C++ code.
921The lock implementation already enforces the required memory ordering
922with respect to access by other threads, and <code>memory_order_relaxed</code>
923specifies that essentially no additional ordering constraints need to be
924enforced for the atomic access.</p>
925
926<p>There is no real analog to this in Java.</p>
927
928<h3 id="hint_only">Result is not relied upon for correctness</h3>
929
930<p>When we use a racing load only to generate a hint, it's generally also OK
931to not enforce any memory ordering for the load. If the value is
932not reliable, we also can't reliably use the result to infer anything about
933other variables. Thus it's OK
934if memory ordering is not guaranteed, and the load is
935supplied with a <code>memory_order_relaxed</code> argument.</p>
936
937<p>A common
938instance of this is the use of C++ <code>compare_exchange</code>
939to atomically replace <code>x</code> by <code>f(x)</code>.
940The initial load of <code>x</code> to compute <code>f(x)</code>
941does not need to be reliable. If we get it wrong, the
942<code>compare_exchange</code> will fail and we will retry.
943It is fine for the initial load of <code>x</code> to use
944a <code>memory_order_relaxed</code> argument; only memory ordering
945for the actual <code>compare_exchange</code> matters.</p>
946
947<h3 id="unread">Atomically modified but unread data</h3>
948
949<p>Occasionally data is modified in parallel by multiple threads, but
950not examined until the parallel computation is complete. A good
951example of this is a counter that is atomically incremented (e.g.
952using <code>fetch_add()</code> in C++ or
953<code>atomic_fetch_add_explicit()</code>
954in C) by multiple threads in parallel, but the result of these calls
955is always ignored. The resulting value is only read at the end,
956after all updates are complete.</p>
957
958<p>In this case, there is no way to tell whether accesses to this data
959was reordered, and hence C++ code may use a <code>memory_order_relaxed</code>
960argument.<p>
961
962<p>Simple event counters are a common example of this. Since it is
963so common, it is worth making some observations about this case:</p>
964
965<ul>
966<li> Use of <code>memory_order_relaxed</code> improves performance,
967but may not address the most important performance issue: Every update
968requires exclusive access to the cache line holding the counter. This
969results in a cache miss every time a new thread accesses the counter.
970If updates are frequent and alternate between threads, it is much faster
971to avoid updating the shared counter every time by,
972for example, using thread-local counters and summing them at the end.
973<li> This technique is combinable with the previous section: It is possible to
974concurrently read approximate and unreliable values while they are being updated,
975with all operations using <code>memory_order_relaxed</code>.
976But it is important to treat the resulting values as completely unreliable.
977Just because the count appears to have been incremented once does not
978mean another thread can be counted on to have reached the point
979at which the increment has been performed. The increment may instead have
980been reordered with earlier code. (As for the similar case we mentioned
981earlier, C++ does guarantee that a second load of such a counter will not
982return a value less than an earlier load in the same thread. Unless of
983course the counter overflowed.)
984<li> It is common to find code that tries to compute approximate
985counter values by performing individual atomic (or not) reads and writes, but
986not making the increment as a whole atomic. The usual argument is that
987this is "close enough" for performance counters or the like.
988It's typically not.
989When updates are sufficiently frequent (a case
990you probably care about), a large fraction of the counts are typically
991lost. On a quad core device, more than half the counts may commonly be lost.
992(Easy exercise: construct a two thread scenario in which the counter is
993updated a million times, but the final counter value is one.)
994</ul>
995
996<h3 id="flag">Simple Flag communication</h3>
997
998<p>A <code>memory_order_release</code> store (or read-modify-write operation)
999ensures that if subsequently a <code>memory_order_acquire</code> load
1000(or read-modify-write operation) reads the written value, then it will
1001also observe any stores (ordinary or atomic) that preceded the
1002A <code>memory_order_release</code> store. Conversely, any loads
1003preceding the <code>memory_order_release</code> will not observe any
1004stores that followed the <code>memory_order_acquire</code> load.
1005Unlike <code>memory_order_relaxed</code>, this allows such atomic operations
1006to be used to communicate the progress of one thread to another.</p>
1007
1008<p>For example, we can rewrite the double-checked locking example
1009from above in C++ as</p>
1010
1011<pre>
1012class MyClass {
1013 private:
1014 atomic&lt;Helper*&gt; helper {nullptr};
1015 mutex mtx;
1016
1017 public:
1018 Helper* getHelper() {
1019 Helper* myHelper = helper.load(memory_order_acquire);
1020 if (myHelper == nullptr) {
1021 lock_guard&lt;mutex&gt; lg(mtx);
1022 myHelper = helper.load(memory_order_relaxed);
1023 if (myHelper == nullptr) {
1024 myHelper = new Helper();
1025 helper.store(myHelper, memory_order_release);
1026 }
1027 }
1028 return myHelper;
1029 }
1030};
1031</pre>
1032
1033<p>The acquire load and release store ensure that if we see a non-null
1034<code>helper</code>, then we will also see its fields correctly initialized.
1035We've also incorporated the prior observation that non-racing loads
1036can use <code>memory_order_relaxed</code>.</p>
1037
1038<p>A Java programmer could conceivably represent <code>helper</code> as a
1039<code>java.util.concurrent.atomic.AtomicReference&lt;Helper&gt;</code>
1040and use <code>lazySet()</code> as the release store. The load
1041operations would continue to use plain <code>get()</code> calls.</p>
1042
1043<p>In both cases, our performance tweaking concentrated on the initialization
1044path, which is unlikely to be performance critical.
1045A more readable compromise might be:</p>
1046
1047<pre>
1048 Helper* getHelper() {
1049 Helper* myHelper = helper.load(memory_order_acquire);
1050 if (myHelper != nullptr) {
1051 return myHelper;
1052 }
1053 lock_guard&ltmutex&gt; lg(mtx);
1054 if (helper == nullptr) {
1055 helper = new Helper();
1056 }
1057 return helper;
1058 }
1059</pre>
1060
1061<p>This provides the same fast path, but resorts to default,
1062sequentially-consistent, operations on the non-performance-critical slow
1063path.</p>
1064
1065<p>Even here, <code>helper.load(memory_order_acquire)</code> is
1066likely to generate the same code on current Android-supported
1067architectures as a plain (sequentially-consistent) reference to
1068<code>helper</code>. Really the most beneficial optimization here
1069may be the introduction of <code>myHelper</code> to eliminate a
1070second load, though a future compiler might do that automatically.</p>
1071
1072<p>Acquire/release ordering does not prevent stores from getting visibly
1073delayed, and does not ensure that stores become visible to other threads
1074in a consistent order. As a result, it does not support a tricky,
1075but fairly common coding pattern exemplified by Dekker's mutual exclusion
1076algorithm: All threads first set a flag indicating that they want to do
1077something; if a thread <i>t</i> then notices that no other thread is
1078trying to do something, it can safely proceed, knowing that there
1079will be no interference. No other thread will be
1080able to proceed, since <i>t</i>'s flag is still set. This fails
1081if the flag is accessed using acquire/release ordering, since that doesn't
1082prevent making a thread's flag visible to others late, after they have
1083erroneously proceeded. Default <code>memory_order_seq_cst</code>
1084does prevent it.</p>
1085
1086<h3 id="immutable">Immutable fields</h3>
1087
1088<p> If an object field is initialized on first use and then never changed,
1089it may be possible to initialize and subsequently read it using weakly
1090ordered accesses. In C++, it could be declared as <code>atomic</code>
1091and accessed using <code>memory_order_relaxed</code> or in Java, it
1092could be declared without <code>volatile</code> and accessed without
1093special measures. This requires that all of the following hold:</p>
1094
1095<ul>
1096<li>It should be possible to tell from the value of the field itself
1097whether it has already been initialized. To access the field,
1098the fast path test-and-return value should read the field only once.
1099In Java the latter is essential. Even if the field tests as initialized,
1100a second load may read the earlier uninitialized value. In C++
1101the "read once" rule is merely good practice.
1102<li>Both initialization and subsequent loads must be atomic,
1103in that partial updates should not be visible. For Java, the field
1104should not be a <code>long</code> or <code>double</code>. For C++,
1105an atomic assignment is required; constructing it in place will not work, since
1106construction of an <code>atomic</code> is not atomic.
1107<li>Repeated initializations must be safe, since multiple threads
1108may read the uninitialized value concurrently. In C++, this generally
1109follows from the "trivially copyable" requirement imposed for all
1110atomic types; types with nested owned pointers would require
1111deallocation in the
1112copy constructor, and would not be trivially copyable. For Java,
1113certain reference types are acceptable:
1114<li>Java references are limited to immutable types containing only final
1115fields. The constructor of the immutable type should not publish
1116a reference to the object. In this case the Java final field rules
1117ensure that if a reader sees the reference, it will also see the
1118initialized final fields. C++ has no analog to these rules and
1119pointers to owned objects are unacceptable for this reason as well (in
1120addition to violating the "trivially copyable" requirements).
1121</ul>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001122
1123<h2 id="closing_notes">Closing Notes</h2>
1124
1125<p>While this document does more than merely scratch the surface, it doesn’t
1126manage more than a shallow gouge. This is a very broad and deep topic. Some
1127areas for further exploration:</p>
1128
1129<ul>
Hans Boehm9e629ad2015-09-14 13:50:00 -07001130<li>The actual Java and C++ memory models are expressed in terms of a
1131<em>happens-before</em> relation that specifies when two actions are guaranteed
1132to occur in a certain order. When we defined a data race, we informally
1133talked about two memory accesses happening "simultaneously".
1134Officially this is defined as neither one happening before the other.
1135It is instructive to learn the actual definitions of <em>happens-before</em>
1136and <em>synchronizes-with</em> in the Java or C++ Memory Model.
1137Although the intuitive notion of "simultaneously" is generally good
1138enough, these definitions are instructive, particularly if you
1139are contemplating using weakly ordered atomic operations in C++.
1140(The current Java specification only defines <code>lazySet()</code>
1141very informally.)</li>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001142<li>Explore what compilers are and aren’t allowed to do when reordering code.
1143(The JSR-133 spec has some great examples of legal transformations that lead to
1144unexpected results.)</li>
1145<li>Find out how to write immutable classes in Java and C++. (There’s more to
1146it than just “don’t change anything after construction”.)</li>
1147<li>Internalize the recommendations in the Concurrency section of <em>Effective
1148Java, 2nd Edition.</em> (For example, you should avoid calling methods that are
1149meant to be overridden while inside a synchronized block.)</li>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001150<li>Read through the {@link java.util.concurrent} and {@link
1151java.util.concurrent.atomic} APIs to see what's available. Consider using
1152concurrency annotations like <code>@ThreadSafe</code> and
1153<code>@GuardedBy</code> (from net.jcip.annotations).</li>
1154</ul>
1155
1156<p>The <a href="#more">Further Reading</a> section in the appendix has links to
1157documents and web sites that will better illuminate these topics.</p>
1158
1159<h2 id="appendix">Appendix</h2>
1160
Dirk Doughertyd5894212012-11-28 18:53:10 -08001161<h3 id="sync_stores">Implementing synchronization stores</h3>
1162
1163<p><em>(This isn’t something most programmers will find themselves implementing,
1164but the discussion is illuminating.)</em></p>
1165
Hans Boehm9e629ad2015-09-14 13:50:00 -07001166<p> For small built-in types like <code>int</code>, and hardware supported by
1167Android, ordinary load and store instructions ensure that a store
1168will be made visible either in its entirety, or not at all, to another
1169processor loading the same location. Thus some basic notion
1170of "atomicity" is provided for free.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001171
Hans Boehm9e629ad2015-09-14 13:50:00 -07001172<p> As we saw before, this does not suffice. In order to ensure sequential
1173consistency we also need to prevent reordering of operations, and to ensure
1174that memory operations become visible to other processes in a consistent
1175order. It turns out that the latter is automatic on Android-supported
1176hardware, provided we make judicious choices for enforcing the former,
1177so we largely ignore it here.</p>
1178
1179<p> Order of memory operations is preserved by both preventing reordering
1180by the compiler, and preventing reordering by the hardware. Here we focus
1181on the latter.
1182
1183<p> Memory ordering on current hardware is generally enforced with
1184"fence" instructions that
1185roughly prevent instructions following the fence from becoming visible
1186before instructions preceding the fence. (These are also commonly
1187called "barrier" instructions, but that risks confusion with
1188<code>pthread_barrier</code>-style barriers, which do much more
1189than this.) The precise meaning of
1190fence instructions is a fairly complicated topic that has to address
1191the way in which guarantees provided by multiple different kinds of fences
1192interact, and how these combine with other ordering guarantees usually
1193provided by the hardware. This is a high level overview, so we will
1194gloss over these details.</p>
1195
1196<p> The most basic kind of ordering guarantee is that provided by C++
1197<code>memory_order_acquire</code> and <code>memory_order_release</code>
1198atomic operations: Memory operations preceding a release store
1199should be visible following an acquire load. On ARM, this is
1200enforced by:</p>
1201
1202<ul>
1203<li>Preceding the store instruction with a suitable fence instruction.
1204This prevents all prior memory accesses from being reordered with the
1205store instruction. (It also unnecessarily prevents reordering with
1206later store instruction. ARMv8 provides an alternative that doesn't share
1207this problem.)
1208<li>Following the load instruction with a suitable fence instruction,
1209preventing the load from being reordered with subsequent accesses.
1210(And once again providing unneeded ordering with at least earlier loads.)
1211</ul>
1212
1213<p> Together these suffice for C++ acquire/release ordering.
1214They are necessary, but not sufficient, for Java <code>volatile</code>
1215or C++ sequentially consistent <code>atomic</code>.</p>
1216
1217<p>To see what else we need, consider the fragment of Dekker’s algorithm
1218we briefly mentioned earlier.
1219<code>flag1</code> and <code>flag2</code> are C++ <code>atomic</code>
1220or Java <code>volatile</code> variables, both initially false.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001221
1222<table>
1223<tr>
1224<th>Thread 1</th>
1225<th>Thread 2</th>
1226</tr>
1227<tr>
1228<td><code>flag1 = true<br />
1229if (flag2 == false)<br />
1230&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
1231<td><code>flag2 = true<br />
1232if (flag1 == false)<br />
1233&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
1234</tr>
Hans Boehm9e629ad2015-09-14 13:50:00 -07001235</table>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001236
Hans Boehm9e629ad2015-09-14 13:50:00 -07001237<p>Sequential consistency implies that one of the assignments to
1238<code>flag</code><i>n</i> must be executed first, and be seen by the
1239test in the other thread. Thus, we will never see
Dirk Doughertyd5894212012-11-28 18:53:10 -08001240these threads executing the “critical-stuff” simultaneously.</p>
1241
Hans Boehm9e629ad2015-09-14 13:50:00 -07001242<p>But the fencing required for acquire-release ordering only adds
1243fences at the beginning and end of each thread, which doesn't help
1244here. We additionally need to ensure that if a
1245<code>volatile</code>/<code>atomic</code> store is followed by
1246a <code>volatile</code>/<code>atomic</code> load, the two are not reordered.
1247This is normally enforced by add a fence not just before a
1248sequentially consistent store, but also after it.
1249(This is again much stronger than required, since this fence typically orders
1250all earlier memory accesses with respect to all later ones. Again ARMv8
1251offers a more targeted solution.)</p>
1252
1253<p>We could instead associate the extra fence with sequentially
1254consistent loads. Since stores are less frequent, the convention
1255we described is more common and used on Android.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001256
1257<p>As we saw in an earlier section, we need to insert a store/load barrier
1258between the two operations. The code executed in the VM for a volatile access
1259will look something like this:</p>
1260
1261<table>
1262<tr>
1263<th>volatile load</th>
1264<th>volatile store</th>
1265</tr>
1266<tr>
1267<td><code>reg = A<br />
Hans Boehm9e629ad2015-09-14 13:50:00 -07001268<em>fence for "acquire" (1)</em></code></td>
1269<td><code><em>fence for "release" (2)</em><br />
Dirk Doughertyd5894212012-11-28 18:53:10 -08001270A = reg<br />
Hans Boehm9e629ad2015-09-14 13:50:00 -07001271<em>fence for later atomic load (3)</em></code></td>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001272</tr>
1273</table>
1274
Hans Boehm9e629ad2015-09-14 13:50:00 -07001275<p>Real machine architectures commonly provide multiple types of
1276fences, which order different types of accesses and may have
1277different cost. The choice between these is subtle, and influenced
1278by the need to ensure that stores are made visible to other cores in
1279a consistent order, and that the memory ordering imposed by the
1280combination of multiple fences composes correctly. For more details,
1281please see the University of Cambridge page with <a href="#more">
1282collected mappings of atomics to actual processors</a>.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001283
Hans Boehm9e629ad2015-09-14 13:50:00 -07001284<p>On some architectures, notably x86, the "acquire" and "release"
1285barriers are unnecessary, since the hardware always implicitly
1286enforces sufficient ordering. Thus on x86 only the last fence (3)
1287is really generated. Similarly on x86, atomic read-modify-write
1288operations implicitly include a strong fence. Thus these never
1289require any fences. On ARM all fences we discussed above are
1290required.</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001291
1292<h3 id="more">Further reading</h3>
1293
1294<p>Web pages and documents that provide greater depth or breadth. The more generally useful articles are nearer the top of the list.</p>
1295
1296<dl>
1297<dt>Shared Memory Consistency Models: A Tutorial</dt>
Hans Boehm9e629ad2015-09-14 13:50:00 -07001298<dd>Written in 1995 by Adve &amp; Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models.
Dirk Doughertyd5894212012-11-28 18:53:10 -08001299<br /><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</a></dd>
1300
1301<dt>Memory Barriers</dt>
1302<dd>Nice little article summarizing the issues.
1303<br /><a href="http://en.wikipedia.org/wiki/Memory_barrier">http://en.wikipedia.org/wiki/Memory_barrier</a></dd>
1304
1305<dt>Threads Basics</dt>
1306<dd>An introduction to multi-threaded programming in C++ and Java, by Hans Boehm. Excellent discussion of data races and basic synchronization methods.
Hans Boehm9e629ad2015-09-14 13:50:00 -07001307<br /><a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html">http://www.hboehm.info/c++mm/threadsintro.html</a></dd>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001308
1309<dt>Java Concurrency In Practice</dt>
1310<dd>Published in 2006, this book covers a wide range of topics in great detail. Highly recommended for anyone writing multi-threaded code in Java.
1311<br /><a href="http://www.javaconcurrencyinpractice.com">http://www.javaconcurrencyinpractice.com</a></dd>
1312
1313<dt>JSR-133 (Java Memory Model) FAQ</dt>
1314<dd>A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields.
Hans Boehm9e629ad2015-09-14 13:50:00 -07001315(A bit dated, particularly when it discusses other languages.)
Dirk Doughertyd5894212012-11-28 18:53:10 -08001316<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html">http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html</a></dd>
1317
Hans Boehm9e629ad2015-09-14 13:50:00 -07001318<dt>Validity of Program Transformations in the Java Memory Model</dt>
1319<dd>A rather technical explanation of remaining problems with the
1320Java memory model. These issues do not apply to data-race-free
1321programs.
1322<br /><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.1790&type=pdf">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.1790&type=pdf</a></dd>
1323
Dirk Doughertyd5894212012-11-28 18:53:10 -08001324<dt>Overview of package java.util.concurrent</dt>
1325<dd>The documentation for the <code>java.util.concurrent</code> package. Near the bottom of the page is a section entitled “Memory Consistency Properties” that explains the guarantees made by the various classes.
1326<br />{@link java.util.concurrent java.util.concurrent} Package Summary</dd>
1327
1328<dt>Java Theory and Practice: Safe Construction Techniques in Java</dt>
1329<dd>This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors.
1330<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp0618.html">http://www.ibm.com/developerworks/java/library/j-jtp0618.html</a></dd>
1331
1332<dt>Java Theory and Practice: Managing Volatility</dt>
1333<dd>A nice article describing what you can and can’t accomplish with volatile fields in Java.
1334<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp06197.html">http://www.ibm.com/developerworks/java/library/j-jtp06197.html</a></dd>
1335
1336<dt>The “Double-Checked Locking is Broken” Declaration</dt>
Hans Boehm9e629ad2015-09-14 13:50:00 -07001337<dd>Bill Pugh’s detailed explanation of the various ways in which double-checked locking is broken without <code>volatile</code> or <code>atomic</code>.
1338Includes C/C++ and Java.
Dirk Doughertyd5894212012-11-28 18:53:10 -08001339<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html</a></dd>
1340
1341<dt>[ARM] Barrier Litmus Tests and Cookbook</dt>
Hans Boehm9e629ad2015-09-14 13:50:00 -07001342<dd>A discussion of ARM SMP issues, illuminated with short snippets of ARM code. If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this. Also describes the instructions used for memory barriers on executable code (possibly useful if you’re generating code on the fly). Note that this predates ARMv8, which also
1343supports an additional set of memory ordering instructions.
Dirk Doughertyd5894212012-11-28 18:53:10 -08001344<br /><a href="http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf">http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf</a></dd>
1345
1346<dt>Linux Kernel Memory Barriers
1347<dd>Documentation for Linux kernel memory barriers. Includes some useful examples and ASCII art.
1348<br/><a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a></dd>
1349
Hans Boehm9e629ad2015-09-14 13:50:00 -07001350<dt>ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), section 1.10 and clause 29 (“Atomic operations library”)</dt>
1351<dd>Draft standard for C++ atomic operation features. This version is
1352close to the C++14 standard, which includes minor changes in this area
1353from C++11.
1354<br /><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4527.pdf">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4527.pdf</a>
1355<br />(intro: <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf</a>)</dd>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001356
1357<dt>ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (“Atomics &lt;stdatomic.h&gt;”)</dt>
Hans Boehm9e629ad2015-09-14 13:50:00 -07001358<dd>Draft standard for ISO/IEC 9899-201x C atomic operation features.
1359For details, also check later defect reports.
1360<br /><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf</a></dd>
1361
1362<dt>C/C++11 mappings to processors (University of Cambridge)</dt>
1363<dd>Jaroslav Sevcik and Peter Sewell's collection of translations
1364of C++ atomics to various common processor instruction sets.
1365<br /><a href="http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html">
1366http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html</a></dd>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001367
1368<dt>Dekker’s algorithm</dt>
1369<dd>The “first known correct solution to the mutual exclusion problem in concurrent programming”. The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware.
1370<br /><a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">http://en.wikipedia.org/wiki/Dekker's_algorithm</a></dd>
1371
1372<dt>Comments on ARM vs. Alpha and address dependencies</dt>
1373<dd>An e-mail on the arm-kernel mailing list from Catalin Marinas. Includes a nice summary of address and control dependencies.
1374<br /><a href="http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html">http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html</a></dd>
1375
1376<dt>What Every Programmer Should Know About Memory</dt>
1377<dd>A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper.
1378<br /><a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a></dd>
1379
1380<dt>Reasoning about the ARM weakly consistent memory model</dt>
1381<dd>This paper was written by Chong & Ishtiaq of ARM, Ltd. It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion. The definition of “observability” used here comes from this paper.
1382<br /><a href="http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711">http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711</a></dd>
1383
1384<dt>The JSR-133 Cookbook for Compiler Writers</dt>
Hans Boehm9e629ad2015-09-14 13:50:00 -07001385<dd>Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation. It contains the initial set of implementation guidelines
1386for the Java memory model that was used by many compiler writers, and is
1387still widely cited and likely to provide insight.
1388Unfortunately, the four fence varieties discussed here are not a good
1389match for Android-supported architectures, and the above C++11 mappings
1390are now a better source of precise recipes, even for Java.
Dirk Doughertyd5894212012-11-28 18:53:10 -08001391<br /><a href="http://g.oswego.edu/dl/jmm/cookbook.html">http://g.oswego.edu/dl/jmm/cookbook.html</a></dd>
1392
Hans Boehm9e629ad2015-09-14 13:50:00 -07001393<dt>x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors</dt>
1394<dd>A precise description of the x86 memory model. Precise descriptions of
1395the ARM memory model are unfortunately significantly more complicated.
1396<br /><a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf">http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf</a></dd>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001397</dl>