blob: a95931b536a9c4e7a517abd5138706eaf9cf6700 [file] [log] [blame]
Dirk Doughertyd5894212012-11-28 18:53:10 -08001page.title=SMP Primer for Android
Joe Fernandez33baa5a2013-11-14 11:41:19 -08002page.tags=ndk,native
Scott Main1c2dea02013-04-10 18:59:29 -07003
Dirk Doughertyd5894212012-11-28 18:53:10 -08004page.article=true
5@jd:body
6
7<div id="tb-wrapper">
8<div id="tb">
9<h2>In this document</h2>
10<ol class="nolist">
11 <li><a href="#theory">Theory</a>
12 <ol class="nolist">
13 <li style="margin: 3px 0 0"><a href="#mem_consistency">Memory consistency models</a>
14 <ol class="nolist">
15 <li style="margin:0"><a href="#proc_consistency">Processor consistency</a></li>
16 <li style="margin:0"><a href="#cpu_cache">CPU cache behavior</a></li>
17 <li style="margin:0"><a href="#observability">Observability</a></li>
18 <li style="margin:0"><a href="#ordering">ARM’s weak ordering</a></li>
19 </ol>
20 </li>
21 <li style="margin:3px 0 0"><a href="#datamem_barriers">Data memory barriers</a>
22 <ol class="nolist">
23 <li style="margin:0"><a href="#ss_ll">Store/store and load/load</a></li>
24 <li style="margin:0"><a href="#ls_sl">Load/store and store/load</a></li>
25 <li style="margin:0"><a href="#barrier_inst">Barrier instructions</a></li>
26 <li style="margin:0"><a href="#addr_dep">Address dependencies and causal consistency</a></li>
27 <li style="margin:0"><a href="#membarrier_summry">Memory barrier summary</a></li>
28 </ol>
29 </li>
30 <li style="margin:3px 0 0"><a href="#atomic_ops">Atomic operations</a>
31 <ol class="nolist">
32 <li style="margin:0"><a href="#atomic_essentials">Atomic essentials</a></li>
33 <li style="margin:0"><a href="#atomic_barrierpairing">Atomic + barrier pairing</a></li>
34 <li style="margin:0"><a href="#acq_rel">Acquire and release</a></li>
35 </ol>
36 </li>
37 </ol>
38 </li>
39 <li><a href="#practice">Practice</a>
40 <ol class="nolist">
41 <li style="margin:3px 0 0"><a href="#c_dont">What not to do in C</a>
42 <ol class="nolist">
43 <li style="margin:0"><a href="#volatile">C/C++ and “volatile”</a></li>
44 <li style="margin:0"><a href="#examplesc">Examples</a></li>
45 </ol>
46 </li>
47 <li style="margin:3px 0 0"><a href="#j_dont">What not to do in Java</a>
48 <ol class="nolist">
49 <li style="margin:0"><a href="#sync_volatile">“synchronized” and “volatile”</a></li>
50 <li style="margin:0"><a href="#examplesj">Examples</a></li>
51 </ol>
52 </li>
53 <li style="margin:3px 0 0"><a href="#bestpractice">What to do</a>
54 <ol class="nolist">
55 <li style="margin:0"><a href="#advice">General advice</a></li>
56 <li style="margin:0"><a href="#sync_guarantees">Synchronization primitive guarantees</a></li>
57 <li style="margin:0"><a href="#ccpp_changes">Upcoming changes to C/C++</a></li>
58 </ol>
59 </li>
60 </ol>
61 </li>
62 <li><a href="#closing_notes">Closing Notes</a></li>
63 <li><a href="#appendix">Appendix</a>
64 <ol class="nolist">
65 <li style="margin:0"><a href="#smp_failure_example">SMP failure example</a></li>
66 <li style="margin:0"><a href="#sync_stores">Implementing synchronization stores</a></li>
67 <li style="margin:0"><a href="#more">Further reading</a></li>
68 </ol>
69 </li>
70</ol>
71</div>
72</div>
73
74<p>Android 3.0 and later platform versions are optimized to support
75multiprocessor architectures. This document introduces issues that
76can arise when writing code for symmetric multiprocessor systems in C, C++, and the Java
77programming language (hereafter referred to simply as “Java” for the sake of
Mark Luc4a01392016-07-18 10:42:11 -070078brevity). It's intended as a primer for Android app developers, not as a complete
Dirk Doughertyd5894212012-11-28 18:53:10 -080079discussion on the subject. The focus is on the ARM CPU architecture.</p>
80
81<p>If you’re in a hurry, you can skip the <a href="#theory">Theory</a> section
82and go directly to <a href="#practice">Practice</a> for best practices, but this
83is not recommended.</p>
84
85
86<h2 id="intro">Introduction</h2>
87
88<p>SMP is an acronym for “Symmetric Multi-Processor”. It describes a design in
89which two or more identical CPU cores share access to main memory. Until
90a few years ago, all Android devices were UP (Uni-Processor).</p>
91
92<p>Most &mdash; if not all &mdash; Android devices do have multiple CPUs, but generally one
93of them is used to run applications while others manage various bits of device
94hardware (for example, the radio). The CPUs may have different architectures, and the
95programs running on them can’t use main memory to communicate with each
96other.</p>
97
98<p>Most Android devices sold today are built around SMP designs,
99making things a bit more complicated for software developers. The sorts of race
100conditions you might encounter in a multi-threaded program are much worse on SMP
101when two or more of your threads are running simultaneously on different cores.
102What’s more, SMP on ARM is more challenging to work with than SMP on x86. Code
103that has been thoroughly tested on x86 may break badly on ARM.</p>
104
105<p>The rest of this document will explain why, and tell you what you need to do
106to ensure that your code behaves correctly.</p>
107
108
109<h2 id="theory">Theory</h2>
110
111<p>This is a high-speed, glossy overview of a complex subject. Some areas will
112be incomplete, but none of it should be misleading or wrong.</p>
113
114<p>See <a href="#more">Further reading</a> at the end of the document for
115pointers to more thorough treatments of the subject.</p>
116
117<h3 id="mem_consistency">Memory consistency models</h3>
118
119<p>Memory consistency models, or often just “memory models”, describe the
120guarantees the hardware architecture makes about memory accesses. For example,
121if you write a value to address A, and then write a value to address B, the
122model might guarantee that every CPU core sees those writes happen in that
123order.</p>
124
125<p>The model most programmers are accustomed to is <em>sequential
126consistency</em>, which is described like this <span
127style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Adve &
128Gharachorloo</a>)</span>:</p>
129
130<ul>
131<li>All memory operations appear to execute one at a time</li>
132<li>All operations on a single processor appear to execute in the order described
133by that processor's program.</li>
134</ul>
135
136<p>If you look at a bit of code and see that it does some reads and writes from
137memory, on a sequentially-consistent CPU architecture you know that the code
138will do those reads and writes in the expected order. It’s possible that the
139CPU is actually reordering instructions and delaying reads and writes, but there
140is no way for code running on the device to tell that the CPU is doing anything
141other than execute instructions in a straightforward manner. (We’re ignoring
142memory-mapped device driver I/O for the moment.)</p>
143
144<p>To illustrate these points it’s useful to consider small snippets of code,
145commonly referred to as <em>litmus tests</em>. These are assumed to execute in
146<em>program order</em>, that is, the order in which the instructions appear here is
147the order in which the CPU will execute them. We don’t want to consider
148instruction reordering performed by compilers just yet.</p>
149
150<p>Here’s a simple example, with code running on two threads:</p>
151
152<table>
153<tr>
154<th>Thread 1</th>
155<th>Thread 2</th>
156</tr>
157<tr>
158<td><code>A = 3<br />
159B = 5</code></td>
160<td><code>reg0 = B<br />
161reg1 = A</code></td>
162</tr>
163</table>
164
165<p>In this and all future litmus examples, memory locations are represented by
166capital letters (A, B, C) and CPU registers start with “reg”. All memory is
167initially zero. Instructions are executed from top to bottom. Here, thread 1
168stores the value 3 at location A, and then the value 5 at location B. Thread 2
169loads the value from location B into reg0, and then loads the value from
170location A into reg1. (Note that we’re writing in one order and reading in
171another.)</p>
172
173<p>Thread 1 and thread 2 are assumed to execute on different CPU cores. You
174should <strong>always</strong> make this assumption when thinking about
175multi-threaded code.</p>
176
177<p>Sequential consistency guarantees that, after both threads have finished
178executing, the registers will be in one of the following states:</p>
179
180
181<table>
182<tr>
183<th>Registers</th>
184<th>States</th>
185</tr>
186<tr>
187<td>reg0=5, reg1=3</td>
188<td>possible (thread 1 ran first)</td>
189</tr>
190<tr>
191<td>reg0=0, reg1=0</td>
192<td>possible (thread 2 ran first)</td>
193</tr>
194<tr>
195<td>reg0=0, reg1=3</td>
196<td>possible (concurrent execution)</td>
197</tr>
198<tr>
199<td>reg0=5, reg1=0</td>
200<td>never</td>
201</tr>
202</table>
203
204<p>To get into a situation where we see B=5 before we see the store to A, either
205the reads or the writes would have to happen out of order. On a
206sequentially-consistent machine, that can’t happen.</p>
207
208<p>Most uni-processors, including x86 and ARM, are sequentially consistent.
209Most SMP systems, including x86 and ARM, are not.</p>
210
211<h4 id="proc_consistency">Processor consistency</h4>
212
213<p>x86 SMP provides <em>processor consistency</em>, which is slightly weaker than
214sequential. While the architecture guarantees that loads are not reordered with
215respect to other loads, and stores are not reordered with respect to other
216stores, it does not guarantee that a store followed by a load will be observed
217in the expected order.</p>
218
219<p>Consider the following example, which is a piece of Dekker’s Algorithm for
220mutual exclusion:</p>
221
222<table>
223<tr>
224<th>Thread 1</th>
225<th>Thread 2</th>
226</tr>
227<tr>
228<td><code>A = true<br />
229reg1 = B<br />
230if (reg1 == false)<br />
231&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
232<td><code>B = true<br />
233reg2 = A<br />
234if (reg2 == false)<br />
235&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
236</tr>
237</table>
238
239<p>The idea is that thread 1 uses A to indicate that it’s busy, and thread 2
240uses B. Thread 1 sets A and then checks to see if B is set; if not, it can
241safely assume that it has exclusive access to the critical section. Thread 2
242does something similar. (If a thread discovers that both A and B are set, a
243turn-taking algorithm is used to ensure fairness.)</p>
244
245<p>On a sequentially-consistent machine, this works correctly. On x86 and ARM
246SMP, the store to A and the load from B in thread 1 can be “observed” in a
247different order by thread 2. If that happened, we could actually appear to
248execute this sequence (where blank lines have been inserted to highlight the
249apparent order of operations):</p>
250
251<table>
252<tr>
253<th>Thread 1</th>
254<th>Thread 2</th>
255</tr>
256<tr>
257<td><code>reg1 = B<br />
258<br />
259<br />
260A = true<br />
261if (reg1 == false)<br />
262&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
263
264<td><code><br />
265B = true<br />
266reg2 = A<br />
267<br />
268if (reg2 == false)<br />
269&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
270</tr>
271</table>
272
273<p>This results in both reg1 and reg2 set to “false”, allowing the threads to
274execute code in the critical section simultaneously. To understand how this can
275happen, it’s useful to know a little about CPU caches.</p>
276
277<h4 id="cpu_cache">CPU cache behavior</h4>
278
279<p>This is a substantial topic in and of itself. An extremely brief overview
280follows. (The motivation for this material is to provide some basis for
281understanding why SMP systems behave as they do.)</p>
282
283<p>Modern CPUs have one or more caches between the processor and main memory.
284These are labeled L1, L2, and so on, with the higher numbers being successively
285“farther” from the CPU. Cache memory adds size and cost to the hardware, and
286increases power consumption, so the ARM CPUs used in Android devices typically
287have small L1 caches and little or no L2/L3.</p>
288
289<p>Loading or storing a value into the L1 cache is very fast. Doing the same to
290main memory can be 10-100x slower. The CPU will therefore try to operate out of
291the cache as much as possible. The <em>write policy</em> of a cache determines when data
292written to it is forwarded to main memory. A <em>write-through</em> cache will initiate
293a write to memory immediately, while a <em>write-back</em> cache will wait until it runs
294out of space and has to evict some entries. In either case, the CPU will
295continue executing instructions past the one that did the store, possibly
296executing dozens of them before the write is visible in main memory. (While the
297write-through cache has a policy of immediately forwarding the data to main
298memory, it only <strong>initiates</strong> the write. It does not have to wait
299for it to finish.)</p>
300
301<p>The cache behavior becomes relevant to this discussion when each CPU core has
302its own private cache. In a simple model, the caches have no way to interact
303with each other directly. The values held by core #1’s cache are not shared
304with or visible to core #2’s cache except as loads or stores from main memory.
305The long latencies on memory accesses would make inter-thread interactions
306sluggish, so it’s useful to define a way for the caches to share data. This
307sharing is called <em>cache coherency</em>, and the coherency rules are defined
308by the CPU architecture’s <em>cache consistency model</em>.</p>
309
310<p>With that in mind, let’s return to the Dekker example. When core 1 executes
311“A = 1”, the value gets stored in core 1’s cache. When core 2 executes “if (A
312== 0)”, it might read from main memory or it might read from core 2’s cache;
313either way it won’t see the store performed by core 1. (“A” could be in core
3142’s cache because of a previous load from “A”.)</p>
315
316<p>For the memory consistency model to be sequentially consistent, core 1 would
317have to wait for all other cores to be aware of “A = 1” before it could execute
318“if (B == 0)” (either through strict cache coherency rules, or by disabling the
319caches entirely so everything operates out of main memory). This would impose a
320performance penalty on every store operation. Relaxing the rules for the
321ordering of stores followed by loads improves performance but imposes a burden
322on software developers.</p>
323
324<p>The other guarantees made by the processor consistency model are less
325expensive to make. For example, to ensure that memory writes are not observed
326out of order, it just needs to ensure that the stores are published to other
327cores in the same order that they were issued. It doesn’t need to wait for
328store #1 to <strong>finish</strong> being published before it can start on store
329#2, it just needs to ensure that it doesn’t finish publishing #2 before it
330finishes publishing #1. This avoids a performance bubble.</p>
331
332<p>Relaxing the guarantees even further can provide additional opportunities for
333CPU optimization, but creates more opportunities for code to behave in ways the
334programmer didn’t expect.</p>
335
336<p>One additional note: CPU caches don’t operate on individual bytes. Data is
337read or written as <em>cache lines</em>; for many ARM CPUs these are 32 bytes. If you
338read data from a location in main memory, you will also be reading some adjacent
339values. Writing data will cause the cache line to be read from memory and
340updated. As a result, you can cause a value to be loaded into cache as a
341side-effect of reading or writing something nearby, adding to the general aura
342of mystery.</p>
343
344<h4 id="observability">Observability</h4>
345
346<p>Before going further, it’s useful to define in a more rigorous fashion what
347is meant by “observing” a load or store. Suppose core 1 executes “A = 1”. The
348store is <em>initiated</em> when the CPU executes the instruction. At some
349point later, possibly through cache coherence activity, the store is
350<em>observed</em> by core 2. In a write-through cache it doesn’t really
351<em>complete</em> until the store arrives in main memory, but the memory
352consistency model doesn’t dictate when something completes, just when it can be
353<em>observed</em>.</p>
354
355
356<p>(In a kernel device driver that accesses memory-mapped I/O locations, it may
357be very important to know when things actually complete. We’re not going to go
358into that here.)</p>
359
360<p>Observability may be defined as follows:</p>
361
362<ul>
363<li>"A write to a location in memory is said to be observed by an observer Pn
364when a subsequent read of the location by Pn would return the value written by
365the write."</li>
366<li>"A read of a location in memory is said to be observed by an observer Pm
367when a subsequent write to the location by Pm would have no effect on the value
368returned by the read." <span style="font-size:.9em;color:#777">(<em><a
369href="#more" style="color:#777">Reasoning about the ARM weakly consistent memory
370model</a></em>)</span></li>
371</ul>
372
373
374<p>A less formal way to describe it (where “you” and “I” are CPU cores) would be:</p>
375
376<ul>
377<li>I have observed your write when I can read what you wrote</li>
378<li>I have observed your read when I can no longer affect the value you read</li>
379</ul>
380
381<p>The notion of observing a write is intuitive; observing a read is a bit less
382so (don’t worry, it grows on you).</p>
383
384<p>With this in mind, we’re ready to talk about ARM.</p>
385
386<h4 id="ordering">ARM's weak ordering</h4>
387
388<p>ARM SMP provides weak memory consistency guarantees. It does not guarantee that
389loads or stores are ordered with respect to each other.</p>
390
391<table>
392<tr>
393<th>Thread 1</th>
394<th>Thread 2</th>
395</tr>
396<tr>
397<td><code>A = 41<br />
398B = 1 // “A is ready”</code></td>
399<td><code>loop_until (B == 1)<br />
400reg = A</code></td>
401</tr>
402</table>
403
404<p>Recall that all addresses are initially zero. The “loop_until” instruction
405reads B repeatedly, looping until we read 1 from B. The idea here is that
406thread 2 is waiting for thread 1 to update A. Thread 1 sets A, and then sets B
407to 1 to indicate data availability.</p>
408
409<p>On x86 SMP, this is guaranteed to work. Thread 2 will observe the stores
410made by thread 1 in program order, and thread 1 will observe thread 2’s loads in
411program order.</p>
412
413<p>On ARM SMP, the loads and stores can be observed in any order. It is
414possible, after all the code has executed, for reg to hold 0. It’s also
415possible for it to hold 41. Unless you explicitly define the ordering, you
416don’t know how this will come out.</p>
417
418<p>(For those with experience on other systems, ARM’s memory model is equivalent
419to PowerPC in most respects.)</p>
420
421
422<h3 id="datamem_barriers">Data memory barriers</h3>
423
424<p>Memory barriers provide a way for your code to tell the CPU that memory
425access ordering matters. ARM/x86 uniprocessors offer sequential consistency,
426and thus have no need for them. (The barrier instructions can be executed but
427aren’t useful; in at least one case they’re hideously expensive, motivating
428separate builds for SMP targets.)</p>
429
430<p>There are four basic situations to consider:</p>
431
432<ol>
433<li>store followed by another store</li>
434<li>load followed by another load</li>
435<li>load followed by store</li>
436<li>store followed by load</li>
437</ol>
438
439<h4 id="ss_ll">Store/store and load/load</h4>
440
441<p>Recall our earlier example:</p>
442
443<table>
444<tr>
445<th>Thread 1</th>
446<th>Thread 2</th>
447</tr>
448<tr>
449<td><code>A = 41<br />
450B = 1 // “A is ready”</code></td>
451<td><code>loop_until (B == 1)<br />
452reg = A</code></td>
453</tr>
454</table>
455
456
457<p>Thread 1 needs to ensure that the store to A happens before the store to B.
458This is a “store/store” situation. Similarly, thread 2 needs to ensure that the
459load of B happens before the load of A; this is a load/load situation. As
460mentioned earlier, the loads and stores can be observed in any order.</p>
461
462<div style="padding:.5em 2em;">
463<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
464<p>Going back to the cache discussion, assume A and B are on separate cache
465lines, with minimal cache coherency. If the store to A stays local but the
466store to B is published, core 2 will see B=1 but won’t see the update to A. On
467the other side, assume we read A earlier, or it lives on the same cache line as
468something else we recently read. Core 2 spins until it sees the update to B,
469then loads A from its local cache, where the value is still zero.</p>
470</div>
471</div>
472
473<p>We can fix it like this:</p>
474
475<table>
476<tr>
477<th>Thread 1</th>
478<th>Thread 2</th>
479</tr>
480<tr>
481<td><code>A = 41<br />
482<em>store/store barrier</em><br />
483B = 1 // “A is ready”</code></td>
484<td><code>loop_until (B == 1)<br />
485<em>load/load barrier</em><br />
486reg = A</code></td>
487</tr>
488</table>
489
490<p>The store/store barrier guarantees that <strong>all observers</strong> will
491observe the write to A before they observe the write to B. It makes no
492guarantees about the ordering of loads in thread 1, but we don’t have any of
493those, so that’s okay. The load/load barrier in thread 2 makes a similar
494guarantee for the loads there.</p>
495
496<p>Since the store/store barrier guarantees that thread 2 observes the stores in
497program order, why do we need the load/load barrier in thread 2? Because we
498also need to guarantee that thread 1 observes the loads in program order.</p>
499
500<div style="padding:.5em 2em;">
501<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
502<p>The store/store barrier could work by flushing all
503dirty entries out of the local cache, ensuring that other cores see them before
504they see any future stores. The load/load barrier could purge the local cache
505completely and wait for any “in-flight” loads to finish, ensuring that future
506loads are observed after previous loads. What the CPU actually does doesn’t
507matter, so long as the appropriate guarantees are kept. If we use a barrier in
508core 1 but not in core 2, core 2 could still be reading A from its local
509cache.</p>
510</div>
511</div>
512
513<p>Because the architectures have different memory models, these barriers are
514required on ARM SMP but not x86 SMP.</p>
515
516<h4 id="ls_sl">Load/store and store/load</h4>
517
518<p>The Dekker’s Algorithm fragment shown earlier illustrated the need for a
519store/load barrier. Here’s an example where a load/store barrier is
520required:</p>
521
522<table>
523<tr>
524<th>Thread 1</th>
525<th>Thread 2</th>
526</tr>
527<tr>
528<td><code>reg = A<br />
529B = 1 // “I have latched A”</code></td>
530<td><code>loop_until (B == 1)<br />
531A = 41 // update A</code></td>
532</tr>
533</table>
534
535<p>Thread 2 could observe thread 1’s store of B=1 before it observe’s thread 1’s
536load from A, and as a result store A=41 before thread 1 has a chance to read A.
537Inserting a load/store barrier in each thread solves the problem:</p>
538
539<table>
540<tr>
541<th>Thread 1</th>
542<th>Thread 2</th>
543</tr>
544<tr>
545<td><code>reg = A<br />
546<em>load/store barrier</em><br />
547B = 1 // “I have latched A”</code></td>
548<td><code>loop_until (B == 1)<br />
549<em>load/store barrier</em><br />
550A = 41 // update A</code></td>
551</tr>
552</table>
553
554<div style="padding:.5em 2em;">
555<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
556<p>A store to local cache may be observed before a load from main memory,
557because accesses to main memory are so much slower. In this case, assume core
5581’s cache has the cache line for B but not A. The load from A is initiated, and
559while that’s in progress execution continues. The store to B happens in local
560cache, and by some means becomes available to core 2 while the load from A is
561still in progress. Thread 2 is able to exit the loop before it has observed
562thread 1’s load from A.</p>
563
564<p>A thornier question is: do we need a barrier in thread 2? If the CPU doesn’t
565perform speculative writes, and doesn’t execute instructions out of order, can
566thread 2 store to A before thread 1’s read if thread 1 guarantees the load/store
567ordering? (Answer: no.) What if there’s a third core watching A and B?
568(Answer: now you need one, or you could observe B==0 / A==41 on the third core.)
569 It’s safest to insert barriers in both places and not worry about the
570details.</p>
571</div>
572</div>
573
574<p>As mentioned earlier, store/load barriers are the only kind required on x86
575SMP.</p>
576
577<h4 id="barrier_inst">Barrier instructions</h4>
578
579<p>Different CPUs provide different flavors of barrier instruction. For
580example:</p>
581
582<ul>
583<li>Sparc V8 has a “membar” instruction that takes a 4-element bit vector. The
584four categories of barrier can be specified individually.</li>
585<li>Alpha provides “rmb” (load/load), “wmb” (store/store), and “mb” (full).
586(Trivia: the linux kernel provides three memory barrier functions with these
587names and behaviors.)</li>
588<li>x86 has a variety of options; “mfence” (introduced with SSE2) provides a
589full barrier.</li>
590<li>ARMv7 has “dmb st” (store/store) and “dmb sy” (full).</li>
591</ul>
592
593<p>“Full barrier” means all four categories are included.</p>
594
595<p>It is important to recognize that the only thing guaranteed by barrier
596instructions is ordering. Do not treat them as cache coherency “sync points” or
597synchronous “flush” instructions. The ARM “dmb” instruction has no direct
598effect on other cores. This is important to understand when trying to figure
599out where barrier instructions need to be issued.</p>
600
601
602<h4 id="addr_dep">Address dependencies and causal consistency</h4>
603
604<p><em>(This is a slightly more advanced topic and can be skipped.)</em>
605
606<p>The ARM CPU provides one special case where a load/load barrier can be
607avoided. Consider the following example from earlier, modified slightly:</p>
608
609<table>
610<tr>
611<th>Thread 1</th>
612<th>Thread 2</th>
613</tr>
614<tr>
615<td><code>[A+8] = 41<br />
616<em>store/store barrier</em><br />
617B = 1 // “A is ready”</code></td>
618<td><code>loop:<br />
619&nbsp;&nbsp;&nbsp;&nbsp;reg0 = B<br />
620&nbsp;&nbsp;&nbsp;&nbsp;if (reg0 == 0) goto loop<br />
621reg1 = 8<br />
622reg2 = [A + reg1]</code></td>
623</tr>
624</table>
625
626<p>This introduces a new notation. If “A” refers to a memory address, “A+n”
627refers to a memory address offset by 8 bytes from A. If A is the base address
628of an object or array, [A+8] could be a field in the object or an element in the
629array.</p>
630
631<p>The “loop_until” seen in previous examples has been expanded to show the load
632of B into reg0. reg1 is assigned the numeric value 8, and reg2 is loaded from
Andy McFadden9b6687b2013-01-15 12:26:04 -0800633the address [A+reg1] (the same location that thread 1 is accessing).</p>
Dirk Doughertyd5894212012-11-28 18:53:10 -0800634
635<p>This will not behave correctly because the load from B could be observed
636after the load from [A+reg1]. We can fix this with a load/load barrier after
637the loop, but on ARM we can also just do this:</p>
638
639<table>
640<tr>
641<th>Thread 1</th>
642<th>Thread 2</th>
643</tr>
644<tr>
Andy McFadden9b6687b2013-01-15 12:26:04 -0800645<td><code>[A+8] = 41<br />
Dirk Doughertyd5894212012-11-28 18:53:10 -0800646<em>store/store barrier</em><br />
647B = 1 // “A is ready”</code></td>
648<td><code>loop:<br />
649&nbsp;&nbsp;&nbsp;&nbsp;reg0 = B<br />
650&nbsp;&nbsp;&nbsp;&nbsp;if (reg0 == 0) goto loop<br />
651reg1 = 8 <strong>+ (reg0 & 0)</strong><br />
652reg2 = [A + reg1]</code></td>
653</tr>
654</table>
655
656<p>What we’ve done here is change the assignment of reg1 from a constant (8) to
657a value that depends on what we loaded from B. In this case, we do a bitwise
658AND of the value with 0, which yields zero, which means reg1 still has the value
6598. However, the ARM CPU believes that the load from [A+reg1] depends upon the
660load from B, and will ensure that the two are observed in program order.</p>
661
662<p>This is called an <em>address dependency</em>. Address dependencies exist
663when the value returned by a load is used to compute the address of a subsequent
664load or store. It can let you avoid the need for an explicit barrier in certain
665situations.</p>
666
667<p>ARM does not provide <em>control dependency</em> guarantees. To illustrate
668this it’s necessary to dip into ARM code for a moment: <span
669style="font-size:.9em;color:#777">(<em><a href="#more"
670style="color:#777">Barrier Litmus Tests and Cookbook</a></em>)</span>.</p>
671
672<pre>
673LDR r1, [r0]
674CMP r1, #55
675LDRNE r2, [r3]
676</pre>
677
678<p>The loads from r0 and r3 may be observed out of order, even though the load
679from r3 will not execute at all if [r0] doesn’t hold 55. Inserting AND r1, r1,
680#0 and replacing the last instruction with LDRNE r2, [r3, r1] would ensure
681proper ordering without an explicit barrier. (This is a prime example of why
682you can’t think about consistency issues in terms of instruction execution.
683Always think in terms of memory accesses.)</p>
684
685<p>While we’re hip-deep, it’s worth noting that ARM does not provide <em>causal
686consistency</em>:</p>
687
688<table>
689<tr>
690<th>Thread 1</th>
691<th>Thread 2</th>
692<th>Thread 3</th>
693</tr>
694<tr>
695<td><code>A = 1</code></td>
696<td><code>loop_until (A == 1)<br />
697B = 1</code></td>
698<td><code>loop:<br />
699&nbsp;&nbsp;reg0 = B<br />
700&nbsp;&nbsp;if (reg0 == 0) goto loop<br />
701reg1 = reg0 & 0<br />
702reg2 = [A+reg1]</code></td>
703</tr>
704</table>
705
706<p>Here, thread 1 sets A, signaling thread 2. Thread 2 sees that and sets B to
707signal thread 3. Thread 3 sees it and loads from A, using an address dependency
708to ensure that the load of B and the load of A are observed in program
709order.</p>
710
711<p>It’s possible for reg2 to hold zero at the end of this. The fact that a
712store in thread 1 causes something to happen in thread 2 which causes something
713to happen in thread 3 does not mean that thread 3 will observe the stores in
714that order. (Inserting a load/store barrier in thread 2 fixes this.)</p>
715
716<h4 id="membarrier_summary">Memory barrier summary</h4>
717
718<p>Barriers come in different flavors for different situations. While there can
719be performance advantages to using exactly the right barrier type, there are
720code maintenance risks in doing so &mdash; unless the person updating the code
721fully understands it, they might introduce the wrong type of operation and cause
722a mysterious breakage. Because of this, and because ARM doesn’t provide a wide
723variety of barrier choices, many atomic primitives use full
724barrier instructions when a barrier is required.</p>
725
726<p>The key thing to remember about barriers is that they define ordering. Don’t
727think of them as a “flush” call that causes a series of actions to happen.
728Instead, think of them as a dividing line in time for operations on the current
729CPU core.</p>
730
731
732<h3 id="atomic_ops">Atomic operations</h3>
733
734<p>Atomic operations guarantee that an operation that requires a series of steps
735always behaves as if it were a single operation. For example, consider a
736non-atomic increment (“++A”) executed on the same variable by two threads
737simultaneously:</p>
738
739<table>
740<tr>
741<th>Thread 1</th>
742<th>Thread 2</th>
743</tr>
744<tr>
745<td><code>reg = A<br />
746reg = reg + 1<br />
747A = reg</code></td>
748<td><code>reg = A<br />
749reg = reg + 1<br />
750A = reg</code></td>
751</tr>
752</table>
753
754<p>If the threads execute concurrently from top to bottom, both threads will
755load 0 from A, increment it to 1, and store it back, leaving a final result of
7561. If we used an atomic increment operation, you would be guaranteed that the
757final result will be 2.</p>
758
759<h4 id="atomic_essentials">Atomic essentials</h4>
760
761<p>The most fundamental operations &mdash; loading and storing 32-bit values
762&mdash; are inherently atomic on ARM so long as the data is aligned on a 32-bit
763boundary. For example:</p>
764
765<table>
766<tr>
767<th>Thread 1</th>
768<th>Thread 2</th>
769</tr>
770<tr>
771<td><code>reg = 0x00000000<br />
772A = reg</code></td>
773<td><code>reg = 0xffffffff<br />
774A = reg</code></td>
775</tr>
776</table>
777
778<p>The CPU guarantees that A will hold 0x00000000 or 0xffffffff. It will never
779hold 0x0000ffff or any other partial “mix” of bytes.</p>
780
781<div style="padding:.5em 2em;">
782<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
783<p>The atomicity guarantee is lost if the data isn’t aligned. Misaligned data
784could straddle a cache line, so other cores could see the halves update
785independently. Consequently, the ARMv7 documentation declares that it provides
786“single-copy atomicity” for all byte accesses, halfword accesses to
787halfword-aligned locations, and word accesses to word-aligned locations.
788Doubleword (64-bit) accesses are <strong>not</strong> atomic, unless the
789location is doubleword-aligned and special load/store instructions are used.
790This behavior is important to understand when multiple threads are performing
791unsynchronized updates to packed structures or arrays of primitive types.</p>
792</div>
793</div>
794
795<p>There is no need for 32-bit “atomic read” or “atomic write” functions on ARM
796or x86. Where one is provided for completeness, it just does a trivial load or
797store.</p>
798
799<p>Operations that perform more complex actions on data in memory are
800collectively known as <em>read-modify-write</em> (RMW) instructions, because
801they load data, modify it in some way, and write it back. CPUs vary widely in
802how these are implemented. ARM uses a technique called “Load Linked / Store
803Conditional”, or LL/SC.</p>
804
805<div style="padding:.5em 2em;">
806<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
807<p>A <em>linked</em> or <em>locked</em> load reads the data from memory as
808usual, but also establishes a reservation, tagging the physical memory address.
809The reservation is cleared when another core tries to write to that address. To
810perform an LL/SC, the data is read with a reservation, modified, and then a
811conditional store instruction is used to try to write the data back. If the
812reservation is still in place, the store succeeds; if not, the store will fail.
813Atomic functions based on LL/SC usually loop, retrying the entire
814read-modify-write sequence until it completes without interruption.</p>
815</div>
816</div>
817
818<p>It’s worth noting that the read-modify-write operations would not work
819correctly if they operated on stale data. If two cores perform an atomic
820increment on the same address, and one of them is not able to see what the other
821did because each core is reading and writing from local cache, the operation
822won’t actually be atomic. The CPU’s cache coherency rules ensure that the
823atomic RMW operations remain atomic in an SMP environment.</p>
824
825<p>This should not be construed to mean that atomic RMW operations use a memory
826barrier. On ARM, atomics have no memory barrier semantics. While a series of
827atomic RMW operations on a single address will be observed in program order by
828other cores, there are no guarantees when it comes to the ordering of atomic and
829non-atomic operations.</p>
830
831<p>It often makes sense to pair barriers and atomic operations together. The
832next section describes this in more detail.</p>
833
834<h4 id="atomic_barrierpairing">Atomic + barrier pairing</h4>
835
836<p>As usual, it’s useful to illuminate the discussion with an example. We’re
837going to consider a basic mutual-exclusion primitive called a <em>spin
838lock</em>. The idea is that a memory address (which we’ll call “lock”)
839initially holds zero. When a thread wants to execute code in the critical
840section, it sets the lock to 1, executes the critical code, and then changes it
841back to zero when done. If another thread has already set the lock to 1, we sit
842and spin until the lock changes back to zero.</p>
843
844<p>To make this work we use an atomic RMW primitive called
845<em>compare-and-swap</em>. The function takes three arguments: the memory
846address, the expected current value, and the new value. If the value currently
847in memory matches what we expect, it is replaced with the new value, and the old
848value is returned. If the current value is not what we expect, we don’t change
849anything. A minor variation on this is called <em>compare-and-set</em>; instead
850of returning the old value it returns a boolean indicating whether the swap
851succeeded. For our needs either will work, but compare-and-set is slightly
852simpler for examples, so we use it and just refer to it as “CAS”.</p>
853
854<p>The acquisition of the spin lock is written like this (using a C-like
855language):</p>
856
857<pre>do {
858 success = atomic_cas(&lock, 0, 1)
859} while (!success)
860
861full_memory_barrier()
862
863<em>critical-section</em></pre>
864
865<p>If no thread holds the lock, the lock value will be 0, and the CAS operation
866will set it to 1 to indicate that we now have it. If another thread has it, the
867lock value will be 1, and the CAS operation will fail because the expected
868current value does not match the actual current value. We loop and retry.
869(Note this loop is on top of whatever loop the LL/SC code might be doing inside
870the atomic_cas function.)</p>
871
872<div style="padding:.5em 2em;">
873<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
874<p>On SMP, a spin lock is a useful way to guard a small critical section. If we
875know that another thread is going to execute a handful of instructions and then
876release the lock, we can just burn a few cycles while we wait our turn.
877However, if the other thread happens to be executing on the same core, we’re
878just wasting time because the other thread can’t make progress until the OS
879schedules it again (either by migrating it to a different core or by preempting
880us). A proper spin lock implementation would optimistically spin a few times
881and then fall back on an OS primitive (such as a Linux futex) that allows the
882current thread to sleep while waiting for the other thread to finish up. On a
883uniprocessor you never want to spin at all. For the sake of brevity we’re
884ignoring all this.</p>
885</div>
886</div>
887
888<p>The memory barrier is necessary to ensure that other threads observe the
889acquisition of the lock before they observe any loads or stores in the critical
890section. Without that barrier, the memory accesses could be observed while the
891lock is not held.</p>
892
893<p>The <code>full_memory_barrier</code> call here actually does
894<strong>two</strong> independent operations. First, it issues the CPU’s full
895barrier instruction. Second, it tells the compiler that it is not allowed to
896reorder code around the barrier. That way, we know that the
897<code>atomic_cas</code> call will be executed before anything in the critical
898section. Without this <em>compiler reorder barrier</em>, the compiler has a
899great deal of freedom in how it generates code, and the order of instructions in
900the compiled code might be much different from the order in the source code.</p>
901
902<p>Of course, we also want to make sure that none of the memory accesses
903performed in the critical section are observed after the lock is released. The
904full version of the simple spin lock is:</p>
905
906<pre>do {
907 success = atomic_cas(&lock, 0, 1) // acquire
908} while (!success)
909full_memory_barrier()
910
911<em>critical-section</em>
912
913full_memory_barrier()
914atomic_store(&lock, 0) // release</pre>
915
916<p>We perform our second CPU/compiler memory barrier immediately
917<strong>before</strong> we release the lock, so that loads and stores in the
918critical section are observed before the release of the lock.</p>
919
920<p>As mentioned earlier, the <code>atomic_store</code> operation is a simple
921assignment on ARM and x86. Unlike the atomic RMW operations, we don’t guarantee
922that other threads will see this value immediately. This isn’t a problem,
923though, because we only need to keep the other threads <strong>out</strong>. The
924other threads will stay out until they observe the store of 0. If it takes a
925little while for them to observe it, the other threads will spin a little
926longer, but we will still execute code correctly.</p>
927
928<p>It’s convenient to combine the atomic operation and the barrier call into a
929single function. It also provides other advantages, which will become clear
930shortly.</p>
931
932
933<h4 id="acq_rel">Acquire and release</h4>
934
935<p>When acquiring the spinlock, we issue the atomic CAS and then the barrier.
936When releasing the spinlock, we issue the barrier and then the atomic store.
937This inspires a particular naming convention: operations followed by a barrier
938are “acquiring” operations, while operations preceded by a barrier are
939“releasing” operations. (It would be wise to install the spin lock example
940firmly in mind, as the names are not otherwise intuitive.)</p>
941
942<p>Rewriting the spin lock example with this in mind:</p>
943
944<pre>do {
945 success = atomic_<strong>acquire</strong>_cas(&lock, 0, 1)
946} while (!success)
947
948<em>critical-section</em>
949
950atomic_<strong>release</strong>_store(&lock, 0)</pre>
951
952<p>This is a little more succinct and easier to read, but the real motivation
953for doing this lies in a couple of optimizations we can now perform.</p>
954
955<p>First, consider <code>atomic_release_store</code>. We need to ensure that
956the store of zero to the lock word is observed after any loads or stores in the
957critical section above it. In other words, we need a load/store and store/store
958barrier. In an earlier section we learned that these aren’t necessary on x86
959SMP -- only store/load barriers are required. The implementation of
960<code>atomic_release_store</code> on x86 is therefore just a compiler reorder
961barrier followed by a simple store. No CPU barrier is required.</p>
962
963<p>The second optimization mostly applies to the compiler (although some CPUs,
964such as the Itanium, can take advantage of it as well). The basic principle is
965that code can move across acquire and release barriers, but only in one
966direction.</p>
967
968<p>Suppose we have a mix of locally-visible and globally-visible memory
969accesses, with some miscellaneous computation as well:</p>
970
971<pre>local1 = arg1 / 41
972local2 = threadStruct->field2
973threadStruct->field3 = local2
974
975do {
976 success = atomic_acquire_cas(&lock, 0, 1)
977} while (!success)
978
979local5 = globalStruct->field5
980globalStruct->field6 = local5
981
982atomic_release_store(&lock, 0)</pre>
983
984<p>Here we see two completely independent sets of operations. The first set
985operates on a thread-local data structure, so we’re not concerned about clashes
986with other threads. The second set operates on a global data structure, which
987must be protected with a lock.</p>
988
989<p>A full compiler reorder barrier in the atomic ops will ensure that the
990program order matches the source code order at the lock boundaries. However,
991allowing the compiler to interleave instructions can improve performance. Loads
992from memory can be slow, but the CPU can continue to execute instructions that
993don’t require the result of that load while waiting for it to complete. The
994code might execute more quickly if it were written like this instead:</p>
995
996<pre>do {
997 success = atomic_acquire_cas(&lock, 0, 1)
998} while (!success)
999
1000local2 = threadStruct->field2
1001local5 = globalStruct->field5
1002local1 = arg1 / 41
1003threadStruct->field3 = local2
1004globalStruct->field6 = local5
1005
1006atomic_release_store(&lock, 0)</pre>
1007
1008<p>We issue both loads, do some unrelated computation, and then execute the
1009instructions that make use of the loads. If the integer division takes less
1010time than one of the loads, we essentially get it for free, since it happens
1011during a period where the CPU would have stalled waiting for a load to
1012complete.</p>
1013
1014<p>Note that <strong>all</strong> of the operations are now happening inside the
1015critical section. Since none of the “threadStruct” operations are visible
1016outside the current thread, nothing else can see them until we’re finished here,
1017so it doesn’t matter exactly when they happen.</p>
1018
1019<p>In general, it is always safe to move operations <strong>into</strong> a
1020critical section, but never safe to move operations <strong>out of</strong> a
1021critical section. Put another way, you can migrate code “downward” across an
1022acquire barrier, and “upward” across a release barrier. If the atomic ops used
1023a full barrier, this sort of migration would not be possible.</p>
1024
1025<p>Returning to an earlier point, we can state that on x86 all loads are
1026acquiring loads, and all stores are releasing stores. As a result:</p>
1027
1028<ul>
1029<li>Loads may not be reordered with respect to each other. You can’t take a
1030load and move it “upward” across another load’s acquire barrier.</li>
1031<li>Stores may not be reordered with respect to each other, because you can’t
1032move a store “downward” across another store’s release barrier.</li>
1033<li>A load followed by a store can’t be reordered, because neither instruction
1034will tolerate it.</li>
1035<li>A store followed by a load <strong>can</strong> be reordered, because each
1036instruction can move across the other in that direction.</li>
1037</ul>
1038
1039<p>Hence, you only need store/load barriers on x86 SMP.</p>
1040
1041<p>Labeling atomic operations with “acquire” or “release” describes not only
1042whether the barrier is executed before or after the atomic operation, but also
1043how the compiler is allowed to reorder code.</p>
1044
1045<h2 id="practice">Practice</h2>
1046
1047<p>Debugging memory consistency problems can be very difficult. If a missing
1048memory barrier causes some code to read stale data, you may not be able to
1049figure out why by examining memory dumps with a debugger. By the time you can
1050issue a debugger query, the CPU cores will have all observed the full set of
1051accesses, and the contents of memory and the CPU registers will appear to be in
1052an “impossible” state.</p>
1053
1054<h3 id="c_dont">What not to do in C</h3>
1055
1056<p>Here we present some examples of incorrect code, along with simple ways to
1057fix them. Before we do that, we need to discuss the use of a basic language
1058feature.</p>
1059
David Friedman23bd5922013-09-26 22:30:11 -07001060<h4 id="volatile">C/C++ and "volatile"</h4>
Dirk Doughertyd5894212012-11-28 18:53:10 -08001061
1062<p>When writing single-threaded code, declaring a variable “volatile” can be
1063very useful. The compiler will not omit or reorder accesses to volatile
1064locations. Combine that with the sequential consistency provided by the
1065hardware, and you’re guaranteed that the loads and stores will appear to happen
1066in the expected order.</p>
1067
1068<p>However, accesses to volatile storage may be reordered with non-volatile
1069accesses, so you have to be careful in multi-threaded uniprocessor environments
1070(explicit compiler reorder barriers may be required). There are no atomicity
1071guarantees, and no memory barrier provisions, so “volatile” doesn’t help you at
1072all in multi-threaded SMP environments. The C and C++ language standards are
1073being updated to address this with built-in atomic operations.</p>
1074
1075<p>If you think you need to declare something “volatile”, that is a strong
1076indicator that you should be using one of the atomic operations instead.</p>
1077
1078<h4 id="examplesc">Examples</h4>
1079
1080<p>In most cases you’d be better off with a synchronization primitive (like a
1081pthread mutex) rather than an atomic operation, but we will employ the latter to
1082illustrate how they would be used in a practical situation.</p>
1083
1084<p>For the sake of brevity we’re ignoring the effects of compiler optimizations
1085here &mdash; some of this code is broken even on uniprocessors &mdash; so for
1086all of these examples you must assume that the compiler generates
1087straightforward code (for example, compiled with gcc -O0). The fixes presented here do
1088solve both compiler-reordering and memory-access-ordering issues, but we’re only
1089going to discuss the latter.</p>
1090
1091<pre>MyThing* gGlobalThing = NULL;
1092
1093void initGlobalThing() // runs in thread 1
1094{
1095 MyStruct* thing = malloc(sizeof(*thing));
1096 memset(thing, 0, sizeof(*thing));
1097 thing->x = 5;
1098 thing->y = 10;
1099 /* initialization complete, publish */
1100 gGlobalThing = thing;
1101}
1102
1103void useGlobalThing() // runs in thread 2
1104{
1105 if (gGlobalThing != NULL) {
1106 int i = gGlobalThing->x; // could be 5, 0, or uninitialized data
1107 ...
1108 }
1109}</pre>
1110
1111<p>The idea here is that we allocate a structure, initialize its fields, and at
1112the very end we “publish” it by storing it in a global variable. At that point,
1113any other thread can see it, but that’s fine since it’s fully initialized,
1114right? At least, it would be on x86 SMP or a uniprocessor (again, making the
1115erroneous assumption that the compiler outputs code exactly as we have it in the
1116source).</p>
1117
1118<p>Without a memory barrier, the store to <code>gGlobalThing</code> could be observed before
1119the fields are initialized on ARM. Another thread reading from <code>thing->x</code> could
1120see 5, 0, or even uninitialized data.</p>
1121
1122<p>This can be fixed by changing the last assignment to:</p>
1123
1124<pre> atomic_release_store(&gGlobalThing, thing);</pre>
1125
1126<p>That ensures that all other threads will observe the writes in the proper
1127order, but what about reads? In this case we should be okay on ARM, because the
1128address dependency rules will ensure that any loads from an offset of
1129<code>gGlobalThing</code> are observed after the load of
1130<code>gGlobalThing</code>. However, it’s unwise to rely on architectural
1131details, since it means your code will be very subtly unportable. The complete
1132fix also requires a barrier after the load:</p>
1133
1134<pre> MyThing* thing = atomic_acquire_load(&gGlobalThing);
1135 int i = thing->x;</pre>
1136
1137<p>Now we know the ordering will be correct. This may seem like an awkward way
1138to write code, and it is, but that’s the price you pay for accessing data
1139structures from multiple threads without using locks. Besides, address
1140dependencies won’t always save us:</p>
1141
1142<pre>MyThing gGlobalThing;
1143
1144void initGlobalThing() // runs in thread 1
1145{
1146 gGlobalThing.x = 5;
1147 gGlobalThing.y = 10;
1148 /* initialization complete */
1149 gGlobalThing.initialized = true;
1150}
1151
1152void useGlobalThing() // runs in thread 2
1153{
1154 if (gGlobalThing.initialized) {
1155 int i = gGlobalThing.x; // could be 5 or 0
1156 }
1157}</pre>
1158
1159<p>Because there is no relationship between the <code>initialized</code> field and the
1160others, the reads and writes can be observed out of order. (Note global data is
1161initialized to zero by the OS, so it shouldn’t be possible to read “random”
1162uninitialized data.)</p>
1163
1164<p>We need to replace the store with:</p>
1165<pre> atomic_release_store(&gGlobalThing.initialized, true);</pre>
1166
1167<p>and replace the load with:</p>
1168<pre> int initialized = atomic_acquire_load(&gGlobalThing.initialized);</pre>
1169
1170<p>Another example of the same problem occurs when implementing
1171reference-counted data structures. The reference count itself will be
1172consistent so long as atomic increment and decrement operations are used, but
1173you can still run into trouble at the edges, for example:</p>
1174
1175<pre>void RefCounted::release()
1176{
1177 int oldCount = atomic_dec(&mRefCount);
1178 if (oldCount == 1) { // was decremented to zero
1179 recycleStorage();
1180 }
1181}
1182
1183void useSharedThing(RefCountedThing sharedThing)
1184{
1185 int localVar = sharedThing->x;
1186 sharedThing->release();
1187 sharedThing = NULL; // can’t use this pointer any more
1188 doStuff(localVar); // value of localVar might be wrong
1189}</pre>
1190
1191<p>The <code>release()</code> call decrements the reference count using a
1192barrier-free atomic decrement operation. Because this is an atomic RMW
1193operation, we know that it will work correctly. If the reference count goes to
1194zero, we recycle the storage.</p>
1195
1196<p>The <code>useSharedThing()</code> function extracts what it needs from
1197<code>sharedThing</code> and then releases its copy. However, because we didn’t
1198use a memory barrier, and atomic and non-atomic operations can be reordered,
1199it’s possible for other threads to observe the read of
1200<code>sharedThing->x</code> <strong>after</strong> they observe the recycle
1201operation. It’s therefore possible for <code>localVar</code> to hold a value
1202from "recycled" memory, for example a new object created in the same
1203location by another thread after <code>release()</code> is called.</p>
1204
1205<p>This can be fixed by replacing the call to <code>atomic_dec()</code> with
1206<code>atomic_release_dec()</code>. The barrier ensures that the reads from
1207<code>sharedThing</code> are observed before we recycle the object.</p>
1208
1209<div style="padding:.5em 2em;">
1210<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
1211<p>In most cases the above won’t actually fail, because the “recycle” function
1212is likely guarded by functions that themselves employ barriers (libc heap
1213<code>free()</code>/<code>delete()</code>, or an object pool guarded by a
1214mutex). If the recycle function used a lock-free algorithm implemented without
1215barriers, however, the above code could fail on ARM SMP.</p>
1216</div>
1217</div>
1218
1219<h3 id="j_dont">What not to do in Java</h3>
1220
1221<p>We haven’t discussed some relevant Java language features, so we’ll take a
1222quick look at those first.</p>
1223
1224<h4 id="sync_volatile">Java's "synchronized" and "volatile" keywords</h4>
1225
1226<p>The “synchronized” keyword provides the Java language’s in-built locking
1227mechanism. Every object has an associated “monitor” that can be used to provide
1228mutually exclusive access.</p>
1229
1230<p>The implementation of the “synchronized” block has the same basic structure
1231as the spin lock example: it begins with an acquiring CAS, and ends with a
1232releasing store. This means that compilers and code optimizers are free to
1233migrate code into a “synchronized” block. One practical consequence: you must
1234<strong>not</strong> conclude that code inside a synchronized block happens
1235after the stuff above it or before the stuff below it in a function. Going
1236further, if a method has two synchronized blocks that lock the same object, and
1237there are no operations in the intervening code that are observable by another
1238thread, the compiler may perform “lock coarsening” and combine them into a
1239single block.</p>
1240
1241<p>The other relevant keyword is “volatile”. As defined in the specification
1242for Java 1.4 and earlier, a volatile declaration was about as weak as its C
1243counterpart. The spec for Java 1.5 was updated to provide stronger guarantees,
1244almost to the level of monitor synchronization.</p>
1245
1246<p>The effects of volatile accesses can be illustrated with an example. If
1247thread 1 writes to a volatile field, and thread 2 subsequently reads from that
1248same field, then thread 2 is guaranteed to see that write and all writes
1249previously made by thread 1. More generally, the writes made by
1250<strong>any</strong> thread up to the point where it writes the field will be
1251visible to thead 2 when it does the read. In effect, writing to a volatile is
1252like a monitor release, and reading from a volatile is like a monitor
1253acquire.</p>
1254
1255<p>Non-volatile accesses may be reorded with respect to volatile accesses in the
1256usual ways, for example the compiler could move a non-volatile load or store “above” a
1257volatile store, but couldn’t move it “below”. Volatile accesses may not be
1258reordered with respect to each other. The VM takes care of issuing the
1259appropriate memory barriers.</p>
1260
1261<p>It should be mentioned that, while loads and stores of object references and
1262most primitive types are atomic, <code>long</code> and <code>double</code>
1263fields are not accessed atomically unless they are marked as volatile.
1264Multi-threaded updates to non-volatile 64-bit fields are problematic even on
1265uniprocessors.</p>
1266
1267<h4 id="examplesj">Examples</h4>
1268
1269<p>Here’s a simple, incorrect implementation of a monotonic counter: <span
1270style="font-size:.9em;color:#777">(<em><a href="#more" style="color:#777">Java
1271theory and practice: Managing volatility</a></em>)</span>.</p>
1272
1273<pre>class Counter {
1274 private int mValue;
1275
1276 public int get() {
1277 return mValue;
1278 }
1279 public void incr() {
1280 mValue++;
1281 }
1282}</pre>
1283
1284<p>Assume <code>get()</code> and <code>incr()</code> are called from multiple
1285threads, and we want to be sure that every thread sees the current count when
1286<code>get()</code> is called. The most glaring problem is that
1287<code>mValue++</code> is actually three operations:</p>
1288
1289<ol>
1290<li><code>reg = mValue</code></li>
1291<li><code>reg = reg + 1</code></li>
1292<li><code>mValue = reg</code></li>
1293</ol>
1294
1295<p>If two threads execute in <code>incr()</code> simultaneously, one of the
1296updates could be lost. To make the increment atomic, we need to declare
1297<code>incr()</code> “synchronized”. With this change, the code will run
1298correctly in multi-threaded uniprocessor environments.</p>
1299
1300<p>It’s still broken on SMP, however. Different threads might see different
1301results from <code>get()</code>, because we’re reading the value with an ordinary load. We
1302can correct the problem by declaring <code>get()</code> to be synchronized.
1303With this change, the code is obviously correct.</p>
1304
1305<p>Unfortunately, we’ve introduced the possibility of lock contention, which
1306could hamper performance. Instead of declaring <code>get()</code> to be
1307synchronized, we could declare <code>mValue</code> with “volatile”. (Note
1308<code>incr()</code> must still use <code>synchronize</code>.) Now we know that
1309the volatile write to <code>mValue</code> will be visible to any subsequent volatile read of
1310<code>mValue</code>. <code>incr()</code> will be slightly slower, but
1311<code>get()</code> will be faster, so even in the absence of contention this is
1312a win if reads outnumber writes. (See also {@link
1313java.util.concurrent.atomic.AtomicInteger}.)</p>
1314
1315<p>Here’s another example, similar in form to the earlier C examples:</p>
1316
1317<pre>class MyGoodies {
1318 public int x, y;
1319}
1320class MyClass {
1321 static MyGoodies sGoodies;
1322
1323 void initGoodies() { // runs in thread 1
1324 MyGoodies goods = new MyGoodies();
1325 goods.x = 5;
1326 goods.y = 10;
1327 sGoodies = goods;
1328 }
1329
1330 void useGoodies() { // runs in thread 2
1331 if (sGoodies != null) {
1332 int i = sGoodies.x; // could be 5 or 0
1333 ....
1334 }
1335 }
1336}</pre>
1337
1338<p>This has the same problem as the C code, namely that the assignment
1339<code>sGoodies = goods</code> might be observed before the initialization of the
1340fields in <code>goods</code>. If you declare <code>sGoodies</code> with the
1341volatile keyword, you can think about the loads as if they were
1342<code>atomic_acquire_load()</code> calls, and the stores as if they were
1343<code>atomic_release_store()</code> calls.</p>
1344
1345<p>(Note that only the <code>sGoodies</code> reference itself is volatile. The
1346accesses to the fields inside it are not. The statement <code>z =
1347sGoodies.x</code> will perform a volatile load of <code>MyClass.sGoodies</code>
1348followed by a non-volatile load of <code>sGoodies.x</code>. If you make a local
1349reference <code>MyGoodies localGoods = sGoodies</code>, <code>z =
1350localGoods.x</code> will not perform any volatile loads.)</p>
1351
1352<p>A more common idiom in Java programming is the infamous “double-checked
1353locking”:</p>
1354
1355<pre>class MyClass {
1356 private Helper helper = null;
1357
1358 public Helper getHelper() {
1359 if (helper == null) {
1360 synchronized (this) {
1361 if (helper == null) {
1362 helper = new Helper();
1363 }
1364 }
1365 }
1366 return helper;
1367 }
1368}</pre>
1369
1370<p>The idea is that we want to have a single instance of a <code>Helper</code>
1371object associated with an instance of <code>MyClass</code>. We must only create
1372it once, so we create and return it through a dedicated <code>getHelper()</code>
1373function. To avoid a race in which two threads create the instance, we need to
1374synchronize the object creation. However, we don’t want to pay the overhead for
1375the “synchronized” block on every call, so we only do that part if
1376<code>helper</code> is currently null.</p>
1377
1378<p>This doesn’t work correctly on uniprocessor systems, unless you’re using a
1379traditional Java source compiler and an interpreter-only VM. Once you add fancy
1380code optimizers and JIT compilers it breaks down. See the “‘Double Checked
1381Locking is Broken’ Declaration” link in the appendix for more details, or Item
138271 (“Use lazy initialization judiciously”) in Josh Bloch’s <em>Effective Java,
13832nd Edition.</em>.</p>
1384
1385<p>Running this on an SMP system introduces an additional way to fail. Consider
1386the same code rewritten slightly, as if it were compiled into a C-like language
1387(I’ve added a couple of integer fields to represent <code>Helper’s</code>
1388constructor activity):</p>
1389
1390<pre>if (helper == null) {
1391 // acquire monitor using spinlock
1392 while (atomic_acquire_cas(&this.lock, 0, 1) != success)
1393 ;
1394 if (helper == null) {
1395 newHelper = malloc(sizeof(Helper));
1396 newHelper->x = 5;
1397 newHelper->y = 10;
1398 helper = newHelper;
1399 }
1400 atomic_release_store(&this.lock, 0);
1401}</pre>
1402
1403<p>Now the problem should be obvious: the store to <code>helper</code> is
1404happening before the memory barrier, which means another thread could observe
1405the non-null value of <code>helper</code> before the stores to the
1406<code>x</code>/<code>y</code> fields.</p>
1407
1408<p>You could try to ensure that the store to <code>helper</code> happens after
1409the <code>atomic_release_store()</code> on <code>this.lock</code> by rearranging
1410the code, but that won’t help, because it’s okay to migrate code upward &mdash;
1411the compiler could move the assignment back above the
1412<code>atomic_release_store()</code> to its original position.</p>
1413
1414<p>There are two ways to fix this:</p>
1415<ol>
1416<li>Do the simple thing and delete the outer check. This ensures that we never
1417examine the value of <code>helper</code> outside a synchronized block.</li>
1418<li>Declare <code>helper</code> volatile. With this one small change, the code
1419in Example J-3 will work correctly on Java 1.5 and later. (You may want to take
1420a minute to convince yourself that this is true.)</li>
1421</ol>
1422
1423<p>This next example illustrates two important issues when using volatile:</p>
1424
1425<pre>class MyClass {
1426 int data1, data2;
1427 volatile int vol1, vol2;
1428
1429 void setValues() { // runs in thread 1
1430 data1 = 1;
1431 vol1 = 2;
1432 data2 = 3;
1433 }
1434
1435 void useValues1() { // runs in thread 2
1436 if (vol1 == 2) {
1437 int l1 = data1; // okay
1438 int l2 = data2; // wrong
1439 }
1440 }
1441 void useValues2() { // runs in thread 2
1442 int dummy = vol2;
1443 int l1 = data1; // wrong
1444 int l2 = data2; // wrong
1445 }</pre>
1446
1447<p>Looking at <code>useValues1()</code>, if thread 2 hasn’t yet observed the
1448update to <code>vol1</code>, then it can’t know if <code>data1</code> or
1449<code>data2</code> has been set yet. Once it sees the update to
1450<code>vol1</code>, it knows that the change to <code>data1</code> is also
1451visible, because that was made before <code>vol1</code> was changed. However,
1452it can’t make any assumptions about <code>data2</code>, because that store was
1453performed after the volatile store.</p>
1454
1455<P>The code in <code>useValues2()</code> uses a second volatile field,
1456<code>vol2</code>, in an attempt to force the VM to generate a memory barrier.
1457This doesn’t generally work. To establish a proper “happens-before”
1458relationship, both threads need to be interacting with the same volatile field.
1459You’d have to know that <code>vol2</code> was set after <code>data1/data2</code>
1460in thread 1. (The fact that this doesn’t work is probably obvious from looking
1461at the code; the caution here is against trying to cleverly “cause” a memory
1462barrier instead of creating an ordered series of accesses.)</p>
1463
1464<h3 id="bestpractice">What to do</h3>
1465
1466<h4 id="advice">General advice</h4>
1467
1468<p>In C/C++, use the <code>pthread</code> operations, like mutexes and
1469semaphores. These include the proper memory barriers, providing correct and
1470efficient behavior on all Android platform versions. Be sure to use them
1471correctly, for example be wary of signaling a condition variable without holding the
1472corresponding mutex.</p>
1473
1474<p>It's best to avoid using atomic functions directly. Locking and
1475unlocking a pthread mutex require a single atomic operation each if there’s no
1476contention, so you’re not going to save much by replacing mutex calls with
1477atomic ops. If you need a lock-free design, you must fully understand the
1478concepts in this entire document before you begin (or, better yet, find an
1479existing code library that is known to be correct on SMP ARM).</p>
1480
1481<p>Be extremely circumspect with "volatile” in C/C++. It often indicates a
1482concurrency problem waiting to happen.</p>
1483
1484<p>In Java, the best answer is usually to use an appropriate utility class from
1485the {@link java.util.concurrent} package. The code is well written and well
1486tested on SMP.</p>
1487
1488<p>Perhaps the safest thing you can do is make your class immutable. Objects
1489from classes like String and Integer hold data that cannot be changed once the
1490class is created, avoiding all synchronization issues. The book <em>Effective
1491Java, 2nd Ed.</em> has specific instructions in “Item 15: Minimize Mutability”. Note in
1492particular the importance of declaring fields “final" <span
1493style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Bloch</a>)</span>.</p>
1494
1495<p>If neither of these options is viable, the Java “synchronized” statement
1496should be used to guard any field that can be accessed by more than one thread.
1497If mutexes won’t work for your situation, you should declare shared fields
1498“volatile”, but you must take great care to understand the interactions between
1499threads. The volatile declaration won’t save you from common concurrent
1500programming mistakes, but it will help you avoid the mysterious failures
1501associated with optimizing compilers and SMP mishaps.</p>
1502
1503<p>The Java Memory Model guarantees that assignments to final fields are visible
1504to all threads once the constructor has finished &mdash; this is what ensures
1505proper synchronization of fields in immutable classes. This guarantee does not
1506hold if a partially-constructed object is allowed to become visible to other
1507threads. It is necessary to follow safe construction practices.<span
1508style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Safe
1509Construction Techniques in Java</a>)</span>.</p>
1510
1511<h4 id="sync_guarantees">Synchronization primitive guarantees</h4>
1512
1513<p>The pthread library and VM make a couple of useful guarantees: all accesses
1514previously performed by a thread that creates a new thread are observable by
1515that new thread as soon as it starts, and all accesses performed by a thread
1516that is exiting are observable when a <code>join()</code> on that thread
1517returns. This means you don’t need any additional synchronization when
1518preparing data for a new thread or examining the results of a joined thread.</p>
1519
1520<p>Whether or not these guarantees apply to interactions with pooled threads
1521depends on the thread pool implementation.</p>
1522
1523<p>In C/C++, the pthread library guarantees that any accesses made by a thread
1524before it unlocks a mutex will be observable by another thread after it locks
1525that same mutex. It also guarantees that any accesses made before calling
1526<code>signal()</code> or <code>broadcast()</code> on a condition variable will
1527be observable by the woken thread.</p>
1528
1529<p>Java language threads and monitors make similar guarantees for the comparable
1530operations.</p>
1531
1532<h4 id="ccpp_changes">Upcoming changes to C/C++</h4>
1533
1534<p>The C and C++ language standards are evolving to include a sophisticated
1535collection of atomic operations. A full matrix of calls for common data types
1536is defined, with selectable memory barrier semantics (choose from relaxed,
1537consume, acquire, release, acq_rel, seq_cst).</p>
1538
1539<p>See the <a href="#more">Further Reading</a> section for pointers to the
1540specifications.</p>
1541
1542
1543<h2 id="closing_notes">Closing Notes</h2>
1544
1545<p>While this document does more than merely scratch the surface, it doesn’t
1546manage more than a shallow gouge. This is a very broad and deep topic. Some
1547areas for further exploration:</p>
1548
1549<ul>
1550<li>Learn the definitions of <em>happens-before</em>,
1551<em>synchronizes-with</em>, and other essential concepts from the Java Memory
1552Model. (It’s hard to understand what “volatile” really means without getting
1553into this.)</li>
1554<li>Explore what compilers are and aren’t allowed to do when reordering code.
1555(The JSR-133 spec has some great examples of legal transformations that lead to
1556unexpected results.)</li>
1557<li>Find out how to write immutable classes in Java and C++. (There’s more to
1558it than just “don’t change anything after construction”.)</li>
1559<li>Internalize the recommendations in the Concurrency section of <em>Effective
1560Java, 2nd Edition.</em> (For example, you should avoid calling methods that are
1561meant to be overridden while inside a synchronized block.)</li>
1562<li>Understand what sorts of barriers you can use on x86 and ARM. (And other
1563CPUs for that matter, for example Itanium’s acquire/release instruction
1564modifiers.)</li>
1565<li>Read through the {@link java.util.concurrent} and {@link
1566java.util.concurrent.atomic} APIs to see what's available. Consider using
1567concurrency annotations like <code>@ThreadSafe</code> and
1568<code>@GuardedBy</code> (from net.jcip.annotations).</li>
1569</ul>
1570
1571<p>The <a href="#more">Further Reading</a> section in the appendix has links to
1572documents and web sites that will better illuminate these topics.</p>
1573
1574<h2 id="appendix">Appendix</h2>
1575
1576<h3 id="smp_failure_example">SMP failure example</h3>
1577
1578<p>This document describes a lot of “weird” things that can, in theory, happen.
1579If you’re not convinced that these issues are real, a practical example may be
1580useful.</p>
1581
1582<p>Bill Pugh’s Java memory model <a
1583href="http://www.cs.umd.edu/~pugh/java/memoryModel/">web site</a> has a few
1584test programs on it. One interesting test is ReadAfterWrite.java, which does
1585the following:</p>
1586
1587<table>
1588<tr>
1589<th>Thread 1</th>
1590<th>Thread 2</th>
1591</tr>
1592<tr>
1593<td><code>for (int i = 0; i < ITERATIONS; i++) {<br />
1594&nbsp;&nbsp;&nbsp;&nbsp;a = i;<br />
1595&nbsp;&nbsp;&nbsp;&nbsp;BB[i] = b;<br />
1596}</code></td>
1597<td><code>for (int i = 0; i < ITERATIONS; i++) {<br />
1598&nbsp;&nbsp;&nbsp;&nbsp;b = i;<br />
1599&nbsp;&nbsp;&nbsp;&nbsp;AA[i] = a;<br />
1600}</code></td>
1601</tr>
1602</table>
1603
1604<p>Where <code>a</code> and <code>b</code> are declared as volatile
1605<code>int</code> fields, and <code>AA</code> and <code>BB</code> are ordinary
1606integer arrays.
1607
1608<p>This is trying to determine if the VM ensures that, after a value is written
1609to a volatile, the next read from that volatile sees the new value. The test
1610code executes these loops a million or so times, and then runs through afterward
1611and searches the results for inconsistencies.</p>
1612
1613<p>At the end of execution,<code>AA</code> and <code>BB</code> will be full of
1614gradually-increasing integers. The threads will not run side-by-side in a
1615predictable way, but we can assert a relationship between the array contents.
1616For example, consider this execution fragment:</p>
1617
1618<table>
1619<tr>
1620<th>Thread 1</th>
1621<th>Thread 2</th>
1622</tr>
1623<tr>
1624<td><code>(initially a == 1534)<br />
1625a = 1535<br />
1626BB[1535] = 165<br />
1627a = 1536<br />
1628BB[1536] = 165<br />
1629<br />
1630<br />
1631<br />
1632<br />
1633a = 1537<br />
1634BB[1537] = 167</code></td>
1635<td><code>(initially b == 165)
1636<br />
1637<br />
1638<br />
1639<br />
1640<br />
1641b = 166<br />
1642AA[166] = 1536<br />
1643b = 167<br />
1644AA[167] = 1536<br />
1645<br /></code></td>
1646</tr>
1647</table>
1648
1649<p>(This is written as if the threads were taking turns executing so that it’s
1650more obvious when results from one thread should be visible to the other, but in
1651practice that won’t be the case.)</p>
1652
1653<p>Look at the assignment of <code>AA[166]</code> in thread 2. We are capturing
1654the fact that, at the point where thread 2 was on iteration 166, it can see that
1655thread 1 was on iteration 1536. If we look one step in the future, at thread
16561’s iteration 1537, we expect to see that thread 1 saw that thread 2 was at
1657iteration 166 (or later). <code>BB[1537]</code> holds 167, so it appears things
1658are working.</p>
1659
1660<p>Now suppose we fail to observe a volatile write to <code>b</code>:</p>
1661
1662<table>
1663<tr>
1664<th>Thread 1</th>
1665<th>Thread 2</th>
1666</tr>
1667<tr>
1668<td><code>(initially a == 1534)<br />
1669a = 1535<br />
1670BB[1535] = 165<br />
1671a = 1536<br />
1672BB[1536] = 165<br />
1673<br />
1674<br />
1675<br />
1676<br />
1677a = 1537<br />
1678BB[1537] = 165 // stale b</code></td>
1679<td><code>(initially b == 165)<br />
1680<br />
1681<br />
1682<br />
1683<br />
1684b = 166<br />
1685AA[166] = 1536<br />
1686b = 167<br />
1687AA[167] = 1536</code></td>
1688</tr>
1689</table>
1690
1691<p>Now, <code>BB[1537]</code> holds 165, a smaller value than we expected, so we
1692know we have a problem. Put succinctly, for i=166, BB[AA[i]+1] < i. (This also
1693catches failures by thread 2 to observe writes to <code>a</code>, for example if we
1694miss an update and assign <code>AA[166] = 1535</code>, we will get
1695<code>BB[AA[166]+1] == 165</code>.)</p>
1696
1697<p>If you run the test program under Dalvik (Android 3.0 “Honeycomb” or later)
1698on an SMP ARM device, it will never fail. If you remove the word “volatile”
1699from the declarations of <code>a</code> and <code>b</code>, it will consistently
1700fail. The program is testing to see if the VM is providing sequentially
1701consistent ordering for accesses to <code>a</code> and <code>b</code>, so you
1702will only see correct behavior when the variables are volatile. (It will also
1703succeed if you run the code on a uniprocessor device, or run it while something
1704else is using enough CPU that the kernel doesn’t schedule the test threads on
1705separate cores.)</p>
1706
1707<p>If you run the modified test a few times you will note that it doesn’t fail
1708in the same place every time. The test fails consistently because it performs
1709the operations a million times, and it only needs to see out-of-order accesses
1710once. In practice, failures will be infrequent and difficult to locate. This
1711test program could very well succeed on a broken VM if things just happen to
1712work out.</p>
1713
1714<h3 id="sync_stores">Implementing synchronization stores</h3>
1715
1716<p><em>(This isn’t something most programmers will find themselves implementing,
1717but the discussion is illuminating.)</em></p>
1718
1719<p>Consider once again volatile accesses in Java. Earlier we made reference to
1720their similarities with acquiring loads and releasing stores, which works as a
1721starting point but doesn’t tell the full story.</p>
1722
1723<p>We start with a fragment of Dekker’s algorithm. Initially both
1724<code>flag1</code> and <code>flag2</code> are false:</p>
1725
1726<table>
1727<tr>
1728<th>Thread 1</th>
1729<th>Thread 2</th>
1730</tr>
1731<tr>
1732<td><code>flag1 = true<br />
1733if (flag2 == false)<br />
1734&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
1735<td><code>flag2 = true<br />
1736if (flag1 == false)<br />
1737&nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
1738</tr>
1739</table
1740
1741<p><code>flag1</code> and <code>flag2</code> are declared as volatile boolean
1742fields. The rules for acquiring loads and releasing stores would allow the
1743accesses in each thread to be reordered, breaking the algorithm. Fortunately,
1744the JMM has a few things to say here. Informally:</p>
1745
1746<ul>
1747<li>A write to a volatile field <em>happens-before</em> every subsequent read of that
1748same field. (For this example, it means that if one thread updates a flag, and
1749later on the other thread reads that flag, the reader is guaranteed to see the
1750write.)</li>
1751<li>Every execution has a total order over all volatile field accesses. The
1752order is consistent with program order.</li>
1753</ul>
1754
1755<p>Taken together, these rules say that the volatile accesses in our example
1756must be observable in program order by all threads. Thus, we will never see
1757these threads executing the “critical-stuff” simultaneously.</p>
1758
1759<div style="padding:.5em 2em;">
1760<div style="border-left:4px solid #999;padding:0 1em;font-style:italic;">
1761<p>Another way to think about this is in terms of <em>data races</em>. A data race
1762occurs if two accesses to the same memory location by different threads are not
1763ordered, at least one of them stores to the memory location, and at least one of
1764them is not a synchronization action <span style="font-size:.9em;color:#777">(<a
1765href="#more" style="color:#777">Boehm and McKenney</a>)</span>. The memory model
1766declares that a program free of data races must behave as if executed by a
1767sequentially-consistent machine. Because both <code>flag1</code> and
1768<code>flag2</code> are volatile, and volatile accesses are considered
1769synchronization actions, there are no data races and this code must execute in a
1770sequentially consistent manner.</p>
1771</div>
1772</div>
1773
1774<p>As we saw in an earlier section, we need to insert a store/load barrier
1775between the two operations. The code executed in the VM for a volatile access
1776will look something like this:</p>
1777
1778<table>
1779<tr>
1780<th>volatile load</th>
1781<th>volatile store</th>
1782</tr>
1783<tr>
1784<td><code>reg = A<br />
1785<em>load/load + load/store barrier</em></code></td>
1786<td><code><em>store/store barrier</em><br />
1787A = reg<br />
1788<em>store/load barrier</em></code></td>
1789</tr>
1790</table>
1791
1792<p>The volatile load is just an acquiring load. The volatile store is similar
1793to a releasing store, but we’ve omitted load/store from the pre-store barrier,
1794and added a store/load barrier afterward.</p>
1795
1796<p>What we’re really trying to guarantee, though, is that (using thread 1 as an
1797example) the write to flag1 is observed before the read of flag2. We could
1798issue the store/load barrier before the volatile load instead and get the same
1799result, but because loads tend to outnumber stores it’s best to associate it
1800with the store.</p>
1801
1802<p>On some architectures, it’s possible to implement volatile stores with an
1803atomic operation and skip the explicit store/load barrier. On x86, for example,
1804atomics provide a full barrier. The ARM LL/SC operations don’t include a
1805barrier, so for ARM we must use explicit barriers.</p>
1806
1807<p>(Much of this is due to Doug Lea and his “JSR-133 Cookbook for Compiler
1808Writers” page.)</p>
1809
1810<h3 id="more">Further reading</h3>
1811
1812<p>Web pages and documents that provide greater depth or breadth. The more generally useful articles are nearer the top of the list.</p>
1813
1814<dl>
1815<dt>Shared Memory Consistency Models: A Tutorial</dt>
1816<dd>Written in 1995 by Adve & Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models.
1817<br /><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</a></dd>
1818
1819<dt>Memory Barriers</dt>
1820<dd>Nice little article summarizing the issues.
1821<br /><a href="http://en.wikipedia.org/wiki/Memory_barrier">http://en.wikipedia.org/wiki/Memory_barrier</a></dd>
1822
1823<dt>Threads Basics</dt>
1824<dd>An introduction to multi-threaded programming in C++ and Java, by Hans Boehm. Excellent discussion of data races and basic synchronization methods.
1825<br /><a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html">http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html</a></dd>
1826
1827<dt>Java Concurrency In Practice</dt>
1828<dd>Published in 2006, this book covers a wide range of topics in great detail. Highly recommended for anyone writing multi-threaded code in Java.
1829<br /><a href="http://www.javaconcurrencyinpractice.com">http://www.javaconcurrencyinpractice.com</a></dd>
1830
1831<dt>JSR-133 (Java Memory Model) FAQ</dt>
1832<dd>A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields.
1833<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html">http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html</a></dd>
1834
1835<dt>Overview of package java.util.concurrent</dt>
1836<dd>The documentation for the <code>java.util.concurrent</code> package. Near the bottom of the page is a section entitled “Memory Consistency Properties” that explains the guarantees made by the various classes.
1837<br />{@link java.util.concurrent java.util.concurrent} Package Summary</dd>
1838
1839<dt>Java Theory and Practice: Safe Construction Techniques in Java</dt>
1840<dd>This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors.
1841<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp0618.html">http://www.ibm.com/developerworks/java/library/j-jtp0618.html</a></dd>
1842
1843<dt>Java Theory and Practice: Managing Volatility</dt>
1844<dd>A nice article describing what you can and can’t accomplish with volatile fields in Java.
1845<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp06197.html">http://www.ibm.com/developerworks/java/library/j-jtp06197.html</a></dd>
1846
1847<dt>The “Double-Checked Locking is Broken” Declaration</dt>
1848<dd>Bill Pugh’s detailed explanation of the various ways in which double-checked locking is broken. Includes C/C++ and Java.
1849<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html</a></dd>
1850
1851<dt>[ARM] Barrier Litmus Tests and Cookbook</dt>
1852<dd>A discussion of ARM SMP issues, illuminated with short snippets of ARM code. If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this. Also describes the instructions used for memory barriers on executable code (possibly useful if you’re generating code on the fly).
1853<br /><a href="http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf">http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf</a></dd>
1854
1855<dt>Linux Kernel Memory Barriers
1856<dd>Documentation for Linux kernel memory barriers. Includes some useful examples and ASCII art.
1857<br/><a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a></dd>
1858
1859<dt>ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), chapter 29 (“Atomic operations library”)</dt>
1860<dd>Draft standard for C++ atomic operation features.
1861<br /><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf</a>
1862<br/ >(intro: <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf</a>)</dd>
1863
1864<dt>ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (“Atomics &lt;stdatomic.h&gt;”)</dt>
1865<dd>Draft standard for ISO/IEC 9899-201x C atomic operation features. (See also n1484 for errata.)
1866<br /><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf</a></dd>
1867
1868<dt>Dekker’s algorithm</dt>
1869<dd>The “first known correct solution to the mutual exclusion problem in concurrent programming”. The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware.
1870<br /><a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">http://en.wikipedia.org/wiki/Dekker's_algorithm</a></dd>
1871
1872<dt>Comments on ARM vs. Alpha and address dependencies</dt>
1873<dd>An e-mail on the arm-kernel mailing list from Catalin Marinas. Includes a nice summary of address and control dependencies.
1874<br /><a href="http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html">http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html</a></dd>
1875
1876<dt>What Every Programmer Should Know About Memory</dt>
1877<dd>A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper.
1878<br /><a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a></dd>
1879
1880<dt>Reasoning about the ARM weakly consistent memory model</dt>
1881<dd>This paper was written by Chong & Ishtiaq of ARM, Ltd. It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion. The definition of “observability” used here comes from this paper.
1882<br /><a href="http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711">http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711</a></dd>
1883
1884<dt>The JSR-133 Cookbook for Compiler Writers</dt>
1885<dd>Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation. It goes much deeper into the details than most people will need to worry about, but it provides good fodder for contemplation.
1886<br /><a href="http://g.oswego.edu/dl/jmm/cookbook.html">http://g.oswego.edu/dl/jmm/cookbook.html</a></dd>
1887
1888<dt>The Semantics of Power and ARM Multiprocessor Machine Code</dt>
1889<dd>If you prefer your explanations in rigorous mathematical form, this is a fine place to go next.
1890<br /><a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf">http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf</a></dd>
1891</dl>