Blame - docs/html/training/articles/smp.jd - platform/frameworks/base

blob: 20d2ee064fcd4721d80e390ac4d6e632883744af [file] [log] [blame]

Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1	page.title=SMP Primer for Android
Joe Fernandez	33baa5a	2013-11-14 11:41:19 -0800	[diff] [blame]	2	page.tags=ndk,native
Scott Main	1c2dea0	2013-04-10 18:59:29 -0700	[diff] [blame]	3
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	4	page.article=true
				5	@jd:body
				6
				7	<div id="tb-wrapper">
				8	<div id="tb">
				9	<h2>In this document</h2>
				10	<ol class="nolist">
				11	<li><a href="#theory">Theory</a>
				12	<ol class="nolist">
				13	<li style="margin: 3px 0 0"><a href="#mem_consistency">Memory consistency models</a>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	14	</li>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	15	<li style="margin:3px 0 0"><a href="#racefree">Data-race-free programming</a>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	16	<ol class="nolist">
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	17	<li style="margin:0"><a href="#dataraces">What's a "data race"?</a></li>
				18	<li style="margin:0"><a href="#avoiding">Avoiding data races</a></li>
				19	<li style="margin:0"><a href="#reordering">When memory reordering becomes visible</a></li>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	20	</ol>
				21	</li>
				22	</ol>
				23	</li>
				24	<li><a href="#practice">Practice</a>
				25	<ol class="nolist">
				26	<li style="margin:3px 0 0"><a href="#c_dont">What not to do in C</a>
				27	<ol class="nolist">
				28	<li style="margin:0"><a href="#volatile">C/C++ and “volatile”</a></li>
				29	<li style="margin:0"><a href="#examplesc">Examples</a></li>
				30	</ol>
				31	</li>
				32	<li style="margin:3px 0 0"><a href="#j_dont">What not to do in Java</a>
				33	<ol class="nolist">
				34	<li style="margin:0"><a href="#sync_volatile">“synchronized” and “volatile”</a></li>
				35	<li style="margin:0"><a href="#examplesj">Examples</a></li>
				36	</ol>
				37	</li>
				38	<li style="margin:3px 0 0"><a href="#bestpractice">What to do</a>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	39	</li>
				40	</ol>
				41	</li>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	42	<li><a href="#weak">A little more about weak memory orders</a>
				43	<ol class="nolist">
				44	<li style="margin:0"><a href="#nonracing">Non-racing accesses</a></li>
				45	<li style="margin:0"><a href="#hint_only">Result is not relied upon for correctness</a></li>
				46	<li style="margin:0"><a href="#unread">Atomically modified but unread data</a></li>
				47	<li style="margin:0"><a href="#flag">Simple flag communication</a></li>
				48	<li style="margin:0"><a href="#immutable">Immutable fields</a></li>
				49	</ol>
				50	</li>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	51	<li><a href="#closing_notes">Closing Notes</a></li>
				52	<li><a href="#appendix">Appendix</a>
				53	<ol class="nolist">
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	54	<li style="margin:0"><a href="#sync_stores">Implementing synchronization stores</a></li>
				55	<li style="margin:0"><a href="#more">Further reading</a></li>
				56	</ol>
				57	</li>
				58	</ol>
				59	</div>
				60	</div>
				61
				62	<p>Android 3.0 and later platform versions are optimized to support
				63	multiprocessor architectures. This document introduces issues that
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	64	can arise when writing multithreaded code for symmetric multiprocessor systems in C, C++, and the Java
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	65	programming language (hereafter referred to simply as “Java” for the sake of
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	66	brevity). It's intended as a primer for Android app developers, not as a complete
				67	discussion on the subject.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	68
				69	<h2 id="intro">Introduction</h2>
				70
				71	<p>SMP is an acronym for “Symmetric Multi-Processor”. It describes a design in
				72	which two or more identical CPU cores share access to main memory. Until
				73	a few years ago, all Android devices were UP (Uni-Processor).</p>
				74
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	75	<p>Most — if not all — Android devices always had multiple CPUs, but
				76	in the past only one of them was used to run applications while others manage various bits of device
				77	hardware (for example, the radio). The CPUs may have had different architectures, and the
				78	programs running on them couldn’t use main memory to communicate with each
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	79	other.</p>
				80
				81	<p>Most Android devices sold today are built around SMP designs,
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	82	making things a bit more complicated for software developers. Race conditions
				83	in a multi-threaded program may not cause visible problems on a uniprocessor,
				84	but may fail regularly when two or more of your threads
				85	are running simultaneously on different cores.
				86	What’s more, code may be more or less prone to failures when run on different
				87	processor architectures, or even on different implementations of the same
				88	architecture. Code that has been thoroughly tested on x86 may break badly on ARM.
				89	Code may start to fail when recompiled with a more modern compiler.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	90
				91	<p>The rest of this document will explain why, and tell you what you need to do
				92	to ensure that your code behaves correctly.</p>
				93
				94
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	95	<h2 id="theory">Memory consistency models: Why SMPs are a bit different</h2>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	96
				97	<p>This is a high-speed, glossy overview of a complex subject. Some areas will
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	98	be incomplete, but none of it should be misleading or wrong. As you
				99	will see in the next section, the details here are usually not important.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	100
				101	<p>See <a href="#more">Further reading</a> at the end of the document for
				102	pointers to more thorough treatments of the subject.</p>
				103
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	104	<p>Memory consistency models, or often just “memory models”, describe the
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	105	guarantees the programming language or hardware architecture
				106	makes about memory accesses. For example,
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	107	if you write a value to address A, and then write a value to address B, the
				108	model might guarantee that every CPU core sees those writes happen in that
				109	order.</p>
				110
				111	<p>The model most programmers are accustomed to is <em>sequential
				112	consistency</em>, which is described like this <span
				113	style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Adve &
				114	Gharachorloo</a>)</span>:</p>
				115
				116	<ul>
				117	<li>All memory operations appear to execute one at a time</li>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	118	<li>All operations in a single thread appear to execute in the order described
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	119	by that processor's program.</li>
				120	</ul>
				121
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	122	<p>Let's assume temporarily that we have a very simple compiler or interpreter
				123	that introduces no surprises: It translates
				124	assignments in the source code to load and store instructions in exactly the
				125	corresponding order, one instruction per access. We'll also assume for
				126	simplicity that each thread executes on its own processor.
				127
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	128	<p>If you look at a bit of code and see that it does some reads and writes from
				129	memory, on a sequentially-consistent CPU architecture you know that the code
				130	will do those reads and writes in the expected order. It’s possible that the
				131	CPU is actually reordering instructions and delaying reads and writes, but there
				132	is no way for code running on the device to tell that the CPU is doing anything
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	133	other than execute instructions in a straightforward manner. (We’ll ignore
				134	memory-mapped device driver I/O.)</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	135
				136	<p>To illustrate these points it’s useful to consider small snippets of code,
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	137	commonly referred to as <em>litmus tests</em>.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	138
				139	<p>Here’s a simple example, with code running on two threads:</p>
				140
				141	<table>
				142	<tr>
				143	<th>Thread 1</th>
				144	<th>Thread 2</th>
				145	</tr>
				146	<tr>
				147	<td><code>A = 3<br />
				148	B = 5</code></td>
				149	<td><code>reg0 = B<br />
				150	reg1 = A</code></td>
				151	</tr>
				152	</table>
				153
				154	<p>In this and all future litmus examples, memory locations are represented by
				155	capital letters (A, B, C) and CPU registers start with “reg”. All memory is
				156	initially zero. Instructions are executed from top to bottom. Here, thread 1
				157	stores the value 3 at location A, and then the value 5 at location B. Thread 2
				158	loads the value from location B into reg0, and then loads the value from
				159	location A into reg1. (Note that we’re writing in one order and reading in
				160	another.)</p>
				161
				162	<p>Thread 1 and thread 2 are assumed to execute on different CPU cores. You
				163	should <strong>always</strong> make this assumption when thinking about
				164	multi-threaded code.</p>
				165
				166	<p>Sequential consistency guarantees that, after both threads have finished
				167	executing, the registers will be in one of the following states:</p>
				168
				169
				170	<table>
				171	<tr>
				172	<th>Registers</th>
				173	<th>States</th>
				174	</tr>
				175	<tr>
				176	<td>reg0=5, reg1=3</td>
				177	<td>possible (thread 1 ran first)</td>
				178	</tr>
				179	<tr>
				180	<td>reg0=0, reg1=0</td>
				181	<td>possible (thread 2 ran first)</td>
				182	</tr>
				183	<tr>
				184	<td>reg0=0, reg1=3</td>
				185	<td>possible (concurrent execution)</td>
				186	</tr>
				187	<tr>
				188	<td>reg0=5, reg1=0</td>
				189	<td>never</td>
				190	</tr>
				191	</table>
				192
				193	<p>To get into a situation where we see B=5 before we see the store to A, either
				194	the reads or the writes would have to happen out of order. On a
				195	sequentially-consistent machine, that can’t happen.</p>
				196
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	197	<p>Uni-processors, including x86 and ARM, are normally sequentially consistent.
				198	Threads appear to execute in interleaved fashion, as the OS kernel switches
				199	between them. Most SMP systems, including x86 and ARM,
				200	are not sequentially consistent. For example, it is common for
				201	hardware to buffer stores on their way to memory, so that they
				202	don't immediately reach memory and become visible to other cores.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	203
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	204	<p>The details vary substantially. For example, x86, though not sequentially
				205	consistent, still guarantees that reg0 = 5 and reg1 = 0 remains impossible.
				206	Stores are buffered, but their order is maintained.
				207	ARM, on the other hand, does not. The order of buffered stores is not
				208	maintained, and stores may not reach all other cores at the same time.
				209	These differences are important to assembly programmers.
				210	However, as we will see below, C, C++, or Java programmers can
				211	and should program in a way that hides such architectural differences.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	212
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	213	<p>So far, we've unrealistically assumed that it is only the hardware that
				214	reorders instructions. In reality, the compiler also reorders instructions to
				215	improve performance. In our example, the compiler might decide that some later
				216	code in Thread 2 needed the value of reg1 before it needed reg0, and thus load
				217	reg1 first. Or some prior code may already have loaded A, and the compiler
				218	might decide to reuse that value instead of loading A again. In either case,
				219	the loads to reg0 and reg1 might be reordered.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	220
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	221	<p>Reordering accesses to different memory locations,
				222	either in the hardware, or in the compiler, is
				223	allowed, since it doesn't affect the execution of a single thread, and
				224	it can significantly improve performance. As we will see, with a bit of care,
				225	we can also prevent it from affecting the results of multithreaded programs.</p>
				226
				227	<p>Since compilers can also reorder memory accesses, this problem is actually
				228	not new to SMPs. Even on a uniprocessor, a compiler could reorder the loads to
				229	reg0 and reg1 in our example, and Thread 1 could be scheduled between the
				230	reordered instructions. But if our compiler happened to not reorder, we might
				231	never observe this problem. On most ARM SMPs, even without compiler
				232	reordering, the reordering will probably be seen, possibly after a very large
				233	number of successful executions. Unless you're programming in assembly
				234	language, SMPs generally just make it more likely you'll see problems that were
				235	there all along.</p>
				236
				237	<h2 id="racefree">Data-race-free programming</h2>
				238
				239	<p>Fortunately, there is usually an easy way to avoid thinking about any of
				240	these details. If you follow some straightforward rules, it's usually safe
				241	to forget all of the preceding section except the "sequential consistency" part.
				242	Unfortunately, the other complications may become visible if you
				243	accidentally violate those rules.
				244
				245	<p>Modern programming languages encourage what's known as a "data-race-free"
				246	programming style. So long as you promise not to introduce "data races",
				247	and avoid a handful of constructs that tell the compiler otherwise, the compiler
				248	and hardware promise to provide sequentially consistent results. This doesn't
				249	really mean they avoid memory access reordering. It does mean that if you
				250	follow the rules you won't be able to tell that memory accesses are being
				251	reordered. It's a lot like telling you that sausage is a delicious and
				252	appetizing food, so long as you promise not to visit the
				253	sausage factory. Data races are what expose the ugly truth about memory
				254	reordering.</p>
				255
				256	<h3 id="dataraces">What's a "data race"?</h3>
				257
				258	<p>A <i>data race</i> occurs when at least two threads simultaneously access
				259	the same ordinary data, and at least one of them modifies it. By "ordinary
				260	data" we mean something that's not specifically a synchronization object
				261	intended for thread communication. Mutexes, condition variables, Java
				262	volatiles, or C++ atomic objects are not ordinary data, and their accesses
				263	are allowed to race. In fact they are used to prevent data races on other
				264	objects.</p>
				265
				266	<p>In order to determine whether two threads simultaneously access the same
				267	memory location, we can ignore the memory-reordering discussion from above, and
				268	assume sequential consistency. The following program doesn't have a data race
				269	if <code>A</code> and <code>B</code> are ordinary boolean variables that are
				270	initially false:</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	271
				272	<table>
				273	<tr>
				274	<th>Thread 1</th>
				275	<th>Thread 2</th>
				276	</tr>
				277	<tr>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	278	<td><code>if (A) B = true</code></td>
				279	<td><code>if (B) A = true</code></td>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	280	</tr>
				281	</table>
				282
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	283	<p>Since operations are not reordered, both conditions will evaluate to false, and
				284	neither variable is ever updated. Thus there cannot be a data race. There is
				285	no need to think about what might happen if the load from <code>A</code>
				286	and store to <code>B</code> in
				287	Thread 1 were somehow reordered. The compiler is not allowed to reorder Thread
				288	1 by rewriting it as "<code>B = true; if (!A) B = false</code>". That would be
				289	like making sausage in the middle of town in broad daylight.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	290
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	291	<p>Data races are officially defined on basic built-in types like integers and
				292	references or pointers. Assigning to an <code>int</code> while simultaneously
				293	reading it in another thread is clearly a data race. But both the C++
				294	standard library and
				295	the Java Collections libraries are written to allow you to also reason about
				296	data races at the library level. They promise to not introduce data races
				297	unless there are concurrent accesses to the same container, at least one of
				298	which updates it. Updating a <code>set<T></code> in one thread while
				299	simultaneously reading it in another allows the library to introduce a
				300	data race, and can thus be thought of informally as a "library-level data race".
				301	Conversely, updating one <code>set<T></code> in one thread, while reading
				302	a different one in another, does not result in a data race, because the
				303	library promises not to introduce a (low-level) data race in that case.
				304
				305	<p>Normally concurrent accesses to different fields in a data structure
				306	cannot introduce a data race. However there is one important exception to
				307	this rule: Contiguous sequences of bit-fields in C or C++ are treated as
				308	a single "memory location". Accessing any bit-field in such a sequence
				309	is treated as accessing all of them for purposes of determining the
				310	existence of a data race. This reflects the inability of common hardware
				311	to update individual bits without also reading and re-writing adjacent bits.
				312	Java programmers have no analogous concerns.</p>
				313
				314	<h3 id="avoiding">Avoiding data races</h3>
				315
				316	Modern programming languages provide a number of synchronization
				317	mechanisms to avoid data races. The most basic tools are:
				318
				319	<dl>
				320	<dt>Locks or Mutexes</dt>
				321
				322	<dd>Mutexes (C++11 <code>std::mutex</code>, or <code>pthread_mutex_t</code>), or
				323	<code>synchronized</code> blocks in Java can be used to ensure that certain
				324	section of code do not run concurrently with other sections of code accessing
				325	the same data. We'll refer to these and other similar facilities generically
				326	as "locks." Consistently acquiring a specific lock before accessing a shared
				327	data structure and releasing it afterwards, prevents data races when accessing
				328	the data structure. It also ensures that updates and accesses are atomic, i.e. no
				329	other update to the data structure can run in the middle. This is deservedly
				330	by far the most common tool for preventing data races. The use of Java
				331	<code>synchronized</code> blocks or C++ <code>lock_guard</code>
				332	or <code>unique_lock</code> ensure that locks are properly released in the
				333	event of an exception.
				334	</dd>
				335
				336	<dt>Volatile/atomic variables</dt>
				337
				338	<dd>Java provides <code>volatile</code> fields that support concurrent access
				339	without introducing data races. Since 2011, C and C++ support
				340	<code>atomic</code> variables and fields with similar semantics. These are
				341	typically more difficult to use then locks, since they only ensure that
				342	individual accesses to a single variable are atomic. (In C++ this normally
				343	extends to simple read-modify-write operations, like increments. Java
				344	requires special method calls for that.)
				345	Unlike locks, <code>volatile</code> or <code>atomic</code> variables can't
				346	be used directly to prevent other threads from interfering with longer code sequences.</dd>
				347
				348	</dl>
				349
				350	<p>It's important to note that <code>volatile</code> has very different
				351	meanings in C++ and Java. In C++, <code>volatile</code> does not prevent data
				352	races, though older code often uses it as a workaround for the lack of
				353	<code>atomic</code> objects. This is no longer recommended; in
				354	C++, use <code>atomic<T></code> for variables that can be concurrently
				355	accessed by multiple threads. C++ <code>volatile</code> is meant for
				356	device registers and the like.</p>
				357
				358	<p>C/C++ <code>atomic</code> variables or Java <code>volatile</code> variables
				359	can be used to prevent data races on other variables. If <code>flag</code> is
				360	declared to have type <code>atomic<bool></code>
				361	or <code>atomic_bool</code>(C/C++) or <code>volatile boolean</code> (Java),
				362	and is initially false then the following snippet is data-race-free:</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	363
				364	<table>
				365	<tr>
				366	<th>Thread 1</th>
				367	<th>Thread 2</th>
				368	</tr>
				369	<tr>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	370	<td><code>A = ...<br />
				371	flag = true</code>
				372	</td>
				373	<td><code>while (!flag) {}<br />
				374	... = A</code>
				375	</td>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	376	</tr>
				377	</table>
				378
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	379	<p>Since Thread 2 waits for <code>flag</code> to be set, the access to
				380	<code>A</code> in Thread 2 must happen after, and not concurrently with, the
				381	assignment to <code>A</code> in Thread 1. Thus there is no data race on
				382	<code>A</code>. The race on <code>flag</code> doesn't count as a data race,
				383	since volatile/atomic accesses are not "ordinary memory accesses".</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	384
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	385	<p>The implementation is required to prevent or hide memory reordering
				386	sufficiently to make code like the preceding litmus test behave as expected.
				387	This normally makes volatile/atomic memory accesses
				388	substantially more expensive than ordinary accesses.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	389
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	390	<p>Although the preceding example is data-race-free, locks together with
				391	<code>Object.wait()</code> in Java or condition variables in C/C++ usually
				392	provide a better solution that does not involve waiting in a loop while
				393	draining battery power.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	394
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	395	<h3 id="reordering">When memory reordering becomes visible</h3>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	396
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	397	Data-race-free programming normally saves us from having to explicitly deal
				398	with memory access reordering issues. However, there are several cases in
				399	which reordering does become visible:
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	400
				401	<ol>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	402	<li> If your program has a bug resulting in an unintentional data race,
				403	compiler and hardware transformations can become visible, and the behavior
				404	of your program may be surprising. For example, if we forgot to declare
				405	<code>flag</code> volatile in the preceding example, Thread 2 may see an
				406	uninitialized <code>A</code>. Or the compiler may decide that flag can't
				407	possibly change during Thread 2's loop and transform the program to
				408
				409	<table>
				410	<tr>
				411	<th>Thread 1</th>
				412	<th>Thread 2</th>
				413	</tr>
				414	<tr>
				415	<td><code>A = ...<br />
				416	flag = true</code>
				417	</td>
				418	<td>reg0 = flag;
				419	while (!reg0) {}<br />
				420	... = A
				421	</td>
				422	</tr>
				423	</table>
				424
				425	When you debug, you may well see the loop continuing forever in spite of
				426	the fact that <code>flag</code> is true.</li>
				427
				428	<li> C++ provides facilities for explicitly relaxing
				429	sequential consistency even if there are no races. Atomic operations
				430	can take explicit <code>memory_order_</code>... arguments. Similarly, the
				431	<code>java.util.concurrent.atomic</code> package provides a more restricted
				432	set of similar facilities, notably <code>lazySet()</code>. And Java
				433	programmers occasionally use intentional data races for similar effect.
				434	All of these provide performance improvements at a large
				435	cost in programming complexity. We discuss them only briefly
				436	<a href="#weak">below</a>.</li>
				437
				438	<li> Some C and C++ code is written in an older style, not entirely
				439	consistent with current language standards, in which <code>volatile</code>
				440	variables are used instead of <code>atomic</code> ones, and memory ordering
				441	is explicitly disallowed by inserting so called <i>fences</i> or
				442	<i>barriers</i>. This requires explicit reasoning about access
				443	reordering and understanding of hardware memory models. A coding style
				444	along these lines is still used in the Linux kernel. It should not
				445	be used in new Android applications, and is also not further discussed here.
				446	</li>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	447	</ol>
				448
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	449	<h2 id="practice">Practice</h2>
				450
				451	<p>Debugging memory consistency problems can be very difficult. If a missing
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	452	lock, <code>atomic</code> or <code>volatile</code> declaration causes
				453	some code to read stale data, you may not be able to
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	454	figure out why by examining memory dumps with a debugger. By the time you can
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	455	issue a debugger query, the CPU cores may have all observed the full set of
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	456	accesses, and the contents of memory and the CPU registers will appear to be in
				457	an “impossible” state.</p>
				458
				459	<h3 id="c_dont">What not to do in C</h3>
				460
				461	<p>Here we present some examples of incorrect code, along with simple ways to
				462	fix them. Before we do that, we need to discuss the use of a basic language
				463	feature.</p>
				464
David Friedman	23bd592	2013-09-26 22:30:11 -0700	[diff] [blame]	465	<h4 id="volatile">C/C++ and "volatile"</h4>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	466
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	467	<p>C and C++ <code>volatile</code> declarations are a very special purpose tool.
				468	They prevent <i>the compiler</i> from reordering or removing <i>volatile</i>
				469	accesses. This can be helpful for code accessing hardware device registers,
				470	memory mapped to more than one location, or in connection with
				471	<code>setjmp</code>. But C and C++ <code>volatile</code>, unlike Java
				472	<code>volatile</code>, is not designed for thread communication.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	473
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	474	<p>In C and C++, accesses to <code>volatile</code>
				475	data may be reordered with accessed to non-volatile data, and there are no
				476	atomicity guarantees. Thus <code>volatile</code> can't be used for sharing data between
				477	threads in portable code, even on a uniprocessor. C <code>volatile</code> usually does not
				478	prevent access reordering by the hardware, so by itself it is even less useful in
				479	multi-threaded SMP environments. This is the reason C11 and C++11 support
				480	<code>atomic</code> objects. You should use those instead.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	481
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	482	<p>A lot of older C and C++ code still abuses <code>volatile</code> for thread
				483	communication. This often works correctly for data that fits
				484	in a machine register, provided it is used with either explicit fences or in cases
				485	in which memory ordering is not important. But it is not guaranteed to work
				486	correctly with future compilers.</p>
				487
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	488
				489	<h4 id="examplesc">Examples</h4>
				490
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	491	<p>In most cases you’d be better off with a lock (like a
				492	<code>pthread_mutex_t</code> or C++11 <code>std::mutex</code>) rather than an
				493	atomic operation, but we will employ the latter to illustrate how they would be
				494	used in a practical situation.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	495
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	496
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	497	<pre>MyThing* gGlobalThing = NULL; // Wrong! See below.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	498
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	499	void initGlobalThing() // runs in Thread 1
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	500	{
				501	MyStruct* thing = malloc(sizeof(*thing));
				502	memset(thing, 0, sizeof(*thing));
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	503	thing->x = 5;
				504	thing->y = 10;
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	505	/* initialization complete, publish */
				506	gGlobalThing = thing;
				507	}
				508
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	509	void useGlobalThing() // runs in Thread 2
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	510	{
				511	if (gGlobalThing != NULL) {
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	512	int i = gGlobalThing->x; // could be 5, 0, or uninitialized data
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	513	...
				514	}
				515	}</pre>
				516
				517	<p>The idea here is that we allocate a structure, initialize its fields, and at
				518	the very end we “publish” it by storing it in a global variable. At that point,
				519	any other thread can see it, but that’s fine since it’s fully initialized,
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	520	right?</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	521
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	522	<p>The problem is that the store to <code>gGlobalThing</code> could be observed
				523	before the fields are initialized, typically because either the compiler or the
				524	processor reordered the stores to <code>gGlobalThing</code> and
				525	<code>thing->x</code>. Another thread reading from <code>thing->x</code> could
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	526	see 5, 0, or even uninitialized data.</p>
				527
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	528	<p>The core problem here is a data race on <code>gGlobalThing</code>.
				529	If Thread 1 calls <code>initGlobalThing()</code> while Thread 2
				530	calls <code>useGlobalThing()</code>, <code>gGlobalThing</code> can be
				531	read while being written.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	532
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	533	<p>This can be fixed by declaring <code>gGlobalThing</code> as
				534	atomic. In C++11:</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	535
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	536	<pre>atomic<MyThing*> gGlobalThing(NULL);</pre>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	537
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	538	<p>This ensures that the writes will become visible to other threads
				539	in the proper order. It also guarantees to prevent some other failure
				540	modes that are otherwise allowed, but unlikely to occur on real
				541	Android hardware. For example, it ensures that we cannot see a
				542	<code>gGlobalThing</code> pointer that has only been partially written.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	543
				544	<h3 id="j_dont">What not to do in Java</h3>
				545
				546	<p>We haven’t discussed some relevant Java language features, so we’ll take a
				547	quick look at those first.</p>
				548
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	549	<p>Java technically does not require code to be data-race-free. And there
				550	is a small amount of very-carefully-written Java code that works correctly
				551	in the presence of data races. However, writing such code is extremely
				552	tricky, and we discuss it only briefly below. To make matters
				553	worse, the experts who specified the meaning of such code no longer believe the
				554	specification is correct. (The specification is fine for data-race-free
				555	code.)
				556
				557	<p>For now we will adhere to the data-race-free model, for which Java provides
				558	essentially the same guarantees as C and C++. Again, the language provides
				559	some primitives that explicitly relax sequential consistency, notably the
				560	<code>lazySet()</code> and <code>weakCompareAndSet()</code> calls
				561	in <code>java.util.concurrent.atomic</code>.
				562	As with C and C++, we will ignore these for now.
				563
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	564	<h4 id="sync_volatile">Java's "synchronized" and "volatile" keywords</h4>
				565
				566	<p>The “synchronized” keyword provides the Java language’s in-built locking
				567	mechanism. Every object has an associated “monitor” that can be used to provide
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	568	mutually exclusive access. If two threads try to "synchronize" on the
				569	same object, one of them will wait until the other completes.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	570
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	571	<p>As we mentioned above, Java's <code>volatile T</code> is the analog of
				572	C++11's <code>atomic<T></code>. Concurrent accesses to
				573	<code>volatile</code> fields are allowed, and don't result in data races.
				574	Ignoring <code>lazySet()</code> et al. and data races, it is the Java VM's job to
				575	make sure that the result still appears sequentially consistent.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	576
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	577	<p>In particular, if thread 1 writes to a <code>volatile</code> field, and
				578	thread 2 subsequently reads from that same field and sees the newly written
				579	value, then thread 2 is also guaranteed to see all writes previously made by
				580	thread 1. In terms of memory effect, writing to
				581	a volatile is analogous to a monitor release, and
				582	reading from a volatile is like a monitor acquire.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	583
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	584	<p>There is one notable difference from C++'s <code>atomic</code>:
				585	If we write <code>volatile int x;</code>
				586	in Java, then <code>x++</code> is the same as <code>x = x + 1</code>; it
				587	performs an atomic load, increments the result, and then performs an atomic
				588	store. Unlike C++, the increment as a whole is not atomic.
				589	Atomic increment operations are instead provided by
				590	the <code>java.util.concurrent.atomic</code>.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	591
				592	<h4 id="examplesj">Examples</h4>
				593
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	594	<p>Here’s a simple, <em>incorrect</em> implementation of a monotonic counter: <span
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	595	style="font-size:.9em;color:#777">(<em><a href="#more" style="color:#777">Java
				596	theory and practice: Managing volatility</a></em>)</span>.</p>
				597
				598	<pre>class Counter {
				599	private int mValue;
				600
				601	public int get() {
				602	return mValue;
				603	}
				604	public void incr() {
				605	mValue++;
				606	}
				607	}</pre>
				608
				609	<p>Assume <code>get()</code> and <code>incr()</code> are called from multiple
				610	threads, and we want to be sure that every thread sees the current count when
				611	<code>get()</code> is called. The most glaring problem is that
				612	<code>mValue++</code> is actually three operations:</p>
				613
				614	<ol>
				615	<li><code>reg = mValue</code></li>
				616	<li><code>reg = reg + 1</code></li>
				617	<li><code>mValue = reg</code></li>
				618	</ol>
				619
				620	<p>If two threads execute in <code>incr()</code> simultaneously, one of the
				621	updates could be lost. To make the increment atomic, we need to declare
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	622	<code>incr()</code> “synchronized”.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	623
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	624	<p>It’s still broken however, especially on SMP. There is still a data race,
				625	in that <code>get()</code> can access <code>mValue</code> concurrently with
				626	<code>incr()</code>. Under Java rules, the <code>get()</code> call can be
				627	appear to be reordered with respect to other code. For example, if we read two
				628	counters in a row, the results might appear to be inconsistent
				629	because the <code>get()</code> calls we reordered, either by the hardware or
				630	compiler. We can correct the problem by declaring <code>get()</code> to be
				631	synchronized. With this change, the code is obviously correct.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	632
				633	<p>Unfortunately, we’ve introduced the possibility of lock contention, which
				634	could hamper performance. Instead of declaring <code>get()</code> to be
				635	synchronized, we could declare <code>mValue</code> with “volatile”. (Note
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	636	<code>incr()</code> must still use <code>synchronize</code> since
				637	<code>mValue++</code> is otherwise not a single atomic operation.)
				638	This also avoids all data races, so sequential consistency is preserved.
				639	<code>incr()</code> will be somewhat slower, since it incurs both monitor entry/exit
				640	overhead, and the overhead associated with a volatile store, but
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	641	<code>get()</code> will be faster, so even in the absence of contention this is
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	642	a win if reads greatly outnumber writes. (See also {@link
				643	java.util.concurrent.atomic.AtomicInteger} for a way to completely
				644	remove the synchronized block.)</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	645
				646	<p>Here’s another example, similar in form to the earlier C examples:</p>
				647
				648	<pre>class MyGoodies {
				649	public int x, y;
				650	}
				651	class MyClass {
				652	static MyGoodies sGoodies;
				653
				654	void initGoodies() { // runs in thread 1
				655	MyGoodies goods = new MyGoodies();
				656	goods.x = 5;
				657	goods.y = 10;
				658	sGoodies = goods;
				659	}
				660
				661	void useGoodies() { // runs in thread 2
				662	if (sGoodies != null) {
				663	int i = sGoodies.x; // could be 5 or 0
				664	....
				665	}
				666	}
				667	}</pre>
				668
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	669	<p>This has the same problem as the C code, namely that there is
				670	a data race on <code>sGoodies</code>. Thus the assignment
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	671	<code>sGoodies = goods</code> might be observed before the initialization of the
				672	fields in <code>goods</code>. If you declare <code>sGoodies</code> with the
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	673	<code>volatile</code> keyword, sequential consistency is restored, and things will work
				674	as expected.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	675
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	676	<p>Note that only the <code>sGoodies</code> reference itself is volatile. The
				677	accesses to the fields inside it are not. Once <code>sGoodies</code> is
				678	<code>volatile</code>, and memory ordering is properly preserved, the fields
				679	cannot be concurrently accessed. The statement <code>z =
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	680	sGoodies.x</code> will perform a volatile load of <code>MyClass.sGoodies</code>
				681	followed by a non-volatile load of <code>sGoodies.x</code>. If you make a local
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	682	reference <code>MyGoodies localGoods = sGoodies</code>, then a subsequent <code>z =
				683	localGoods.x</code> will not perform any volatile loads.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	684
				685	<p>A more common idiom in Java programming is the infamous “double-checked
				686	locking”:</p>
				687
				688	<pre>class MyClass {
				689	private Helper helper = null;
				690
				691	public Helper getHelper() {
				692	if (helper == null) {
				693	synchronized (this) {
				694	if (helper == null) {
				695	helper = new Helper();
				696	}
				697	}
				698	}
				699	return helper;
				700	}
				701	}</pre>
				702
				703	<p>The idea is that we want to have a single instance of a <code>Helper</code>
				704	object associated with an instance of <code>MyClass</code>. We must only create
				705	it once, so we create and return it through a dedicated <code>getHelper()</code>
				706	function. To avoid a race in which two threads create the instance, we need to
				707	synchronize the object creation. However, we don’t want to pay the overhead for
				708	the “synchronized” block on every call, so we only do that part if
				709	<code>helper</code> is currently null.</p>
				710
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	711	<p>This has a data race on the <code>helper</code> field. It can be
				712	set concurrently with the <code>helper == null</code> in another thread.
				713	</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	714
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	715	<p>To see how this can fail, consider
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	716	the same code rewritten slightly, as if it were compiled into a C-like language
				717	(I’ve added a couple of integer fields to represent <code>Helper’s</code>
				718	constructor activity):</p>
				719
				720	<pre>if (helper == null) {
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	721	synchronized() {
				722	if (helper == null) {
				723	newHelper = malloc(sizeof(Helper));
				724	newHelper->x = 5;
				725	newHelper->y = 10;
				726	helper = newHelper;
				727	}
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	728	}
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	729	return helper;
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	730	}</pre>
				731
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	732	<p>There is nothing to prevent either the hardware or the compiler
				733	from reordering the store to <code>helper</code> with those to the
				734	<code>x</code>/<code>y</code> fields. Another thread could find
				735	<code>helper</code> non-null but its fields not yet set and ready to use.
				736	For more details and more failure modes, see the “‘Double Checked
				737	Locking is Broken’ Declaration” link in the appendix for more details, or Item
				738	71 (“Use lazy initialization judiciously”) in Josh Bloch’s <em>Effective Java,
				739	2nd Edition.</em>.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	740
				741	<p>There are two ways to fix this:</p>
				742	<ol>
				743	<li>Do the simple thing and delete the outer check. This ensures that we never
				744	examine the value of <code>helper</code> outside a synchronized block.</li>
				745	<li>Declare <code>helper</code> volatile. With this one small change, the code
				746	in Example J-3 will work correctly on Java 1.5 and later. (You may want to take
				747	a minute to convince yourself that this is true.)</li>
				748	</ol>
				749
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	750	<p>Here is another illustration of <code>volatile</code> behavior:</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	751
				752	<pre>class MyClass {
				753	int data1, data2;
				754	volatile int vol1, vol2;
				755
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	756	void setValues() { // runs in Thread 1
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	757	data1 = 1;
				758	vol1 = 2;
				759	data2 = 3;
				760	}
				761
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	762	void useValues() { // runs in Thread 2
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	763	if (vol1 == 2) {
				764	int l1 = data1; // okay
				765	int l2 = data2; // wrong
				766	}
				767	}
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	768	}</pre>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	769
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	770	<p>Looking at <code>useValues()</code>, if Thread 2 hasn’t yet observed the
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	771	update to <code>vol1</code>, then it can’t know if <code>data1</code> or
				772	<code>data2</code> has been set yet. Once it sees the update to
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	773	<code>vol1</code>, it knows that <code>data1</code> can be safely accessed
				774	and correctly read without introducing a data race. However,
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	775	it can’t make any assumptions about <code>data2</code>, because that store was
				776	performed after the volatile store.</p>
				777
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	778	<p>Note that <code>volatile</code> cannot be used to prevent reordering
				779	of other memory accesses that race with each other. It is not guaranteed to
				780	generate a machine memory fence instruction. It can be used to prevent
				781	data races by executing code only when another thread has satisfied a
				782	certain condition.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	783
				784	<h3 id="bestpractice">What to do</h3>
				785
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	786	<p>In C/C++, prefer C++11
				787	synchronization classes, such as <code>std::mutex</code>. If not, use
				788	the corresponding <code>pthread</code> operations.
				789	These include the proper memory fences, providing correct (sequentially consistent
				790	unless otherwise specified)
				791	and efficient behavior on all Android platform versions. Be sure to use them
				792	correctly. For example, remember that condition variable waits may spuriously
				793	return without being signaled, and should thus appear in a loop.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	794
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	795	<p>It's best to avoid using atomic functions directly, unless the data structure
				796	you are implementing is extremely simple, like a counter. Locking and
				797	unlocking a pthread mutex require a single atomic operation each,
				798	and often cost less than a single cache miss, if there’s no
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	799	contention, so you’re not going to save much by replacing mutex calls with
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	800	atomic ops. Lock-free designs for non-trivial data structures require
				801	much more care to ensure that higher level operations on the data structure
				802	appear atomic (as a whole, not just their explicitly atomic pieces).</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	803
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	804	<p>If you do use atomic operations, relaxing ordering with
				805	<code>memory_order</code>... or <code>lazySet()</code> may provide performance
				806	advantages, but requires deeper understanding than we have conveyed so far.
				807	A large fraction of existing code using
				808	these is discovered to have bugs after the fact. Avoid these if possible.
				809	If your use cases doesn't exactly fit one of those in the next section,
				810	make sure you either are an expert, or have consulted one.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	811
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	812	<p>Avoid using <code>volatile</code> for thread communication in C/C++.</p>
				813
				814	<p>In Java, concurrency problems are often best solved by
				815	using an appropriate utility class from
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	816	the {@link java.util.concurrent} package. The code is well written and well
				817	tested on SMP.</p>
				818
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	819	<p>Perhaps the safest thing you can do is make your objects immutable. Objects
				820	from classes like Java's String and Integer hold data that cannot be changed once an
				821	object is created, avoiding all potential for data races on those objects.
				822	The book <em>Effective
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	823	Java, 2nd Ed.</em> has specific instructions in “Item 15: Minimize Mutability”. Note in
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	824	particular the importance of declaring Java fields “final" <span
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	825	style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Bloch</a>)</span>.</p>
				826
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	827	<p>Even if an object is immutable, remember that communicating it to another
				828	thread without any kind of synchronization is a data race. This can occasionally
				829	be acceptable in Java (see below), but requires great care, and is likely to result in
				830	brittle code. If it's not extremely performance critical, add a
				831	<code>volatile</code> declaration. In C++, communicating a pointer or
				832	reference to an immutable object without proper synchronization,
				833	like any data race, is a bug.
				834	In this case, it is reasonably likely to result in intermittent crashes since,
				835	for example, the receiving thread may see an uninitialized method table
				836	pointer due to store reordering.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	837
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	838	<p>If neither an existing library class, nor an immutable class is
				839	appropriate, the Java <code>synchronized</code> statement or C++
				840	<code>lock_guard</code> / <code>unique_lock</code> should be used to guard
				841	accesses to any field that can be accessed by more than one thread. If mutexes won’t
				842	work for your situation, you should declare shared fields
				843	<code>volatile</code> or <code>atomic</code>, but you must take great care to
				844	understand the interactions between threads. These declarations won’t
				845	save you from common concurrent programming mistakes, but they will help you
				846	avoid the mysterious failures associated with optimizing compilers and SMP
				847	mishaps.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	848
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	849	<p>You should avoid
				850	"publishing" a reference to an object, i.e. making it available to other
				851	threads, in its constructor. This is less critical in C++ or if you stick to
				852	our "no data races" advice in Java. But it's always good advice, and becomes
				853	critical if your Java code is
				854	run in other contexts in which the Java security model matters, and untrusted
				855	code may introduce a data race by accessing that "leaked" object reference.
				856	It's also critical if you choose to ignore our warnings and use some of the techniques
				857	in the next section.
				858	See <span style="font-size:.9em;color:#777">(<a href="#more"
				859	style="color:#777">Safe Construction Techniques in Java</a>)</span> for
				860	details</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	861
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	862	<h2 id="weak">A little more about weak memory orders</h2>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	863
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	864	<p>C++11 and later provide explicit mechanisms for relaxing the sequential
				865	consistency guarantees for data-race-free programs. Explicit
				866	<code>memory_order_relaxed</code>, <code>memory_order_acquire</code> (loads
				867	only), and <code>memory_order_release</code>(stores only) arguments for atomic
				868	operations each provide strictly weaker guarantees than the default, typically
				869	implicit, <code>memory_order_seq_cst</code>. <code>memory_order_acq_rel</code>
				870	provides both <code>memory_order_acquire</code> and
				871	<code>memory_order_release</code> guarantees for atomic read-modify write
				872	operations. <code>memory_order_consume</code> is not yet sufficiently
				873	well specified or implemented to be useful, and should be ignored for now.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	874
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	875	<p>The <code>lazySet</code> methods in <code>Java.util.concurrent.atomic</code>
				876	are similar to C++ <code>memory_order_release</code> stores. Java's
				877	ordinary variables are sometimes used as a replacement for
				878	<code>memory_order_relaxed</code> accesses, though they are actually
				879	even weaker. Unlike C++, there is no real mechanism for unordered
				880	accesses to variables that are declared as <code>volatile</code>.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	881
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	882	<p>You should generally avoid these unless there are pressing performance reasons to
				883	use them. On weakly ordered machine architectures like ARM, using them will
				884	commonly save on the order of a few dozen machine cycles for each atomic operation.
				885	On x86, the performance win is limited to stores, and likely to be less
				886	noticeable.
				887	Somewhat counter-intuitively, the benefit may decrease with larger core counts,
				888	as the memory system becomes more of a limiting factor.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	889
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	890	<p>The full semantics of weakly ordered atomics are complicated.
				891	In general they require
				892	precise understanding of the language rules, which we will
				893	not go into here. For example:
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	894
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	895	<ul>
				896	<li> The compiler or hardware can move <code>memory_order_relaxed</code>
				897	accesses into (but not out of) a critical section bounded by a lock
				898	acquisition and release. This means that two
				899	<code>memory_order_relaxed</code> stores may become visible out of order,
				900	even if they are separated by a critical section.
				901	<li> An ordinary Java variable, when abused as a shared counter, may appear
				902	to another thread to decrease, even though it is only incremented by a single
				903	other thread. But this is not true for C++ atomic
				904	<code>memory_order_relaxed</code>.
				905	</ul>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	906
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	907	With that as a warning,
				908	here we give a small number of idioms that seem to cover many of the use
				909	cases for weakly ordered atomics. Many of these are applicable only to C++:
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	910
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	911	<h3 id="nonracing">Non-racing accesses</h3>
				912	<p>It is fairly common that a variable is atomic because it is <em>sometimes</em>
				913	read concurrently with a write, but not all accesses have this issue.
				914	For example a variable
				915	may need to be atomic because it is read outside a critical section, but all
				916	updates are protected by a lock. In that case, a read that happens to be
				917	protected by the same lock
				918	cannot race, since there cannot be concurrent writes. In such a case, the
				919	non-racing access (load in this case), can be annotated with
				920	<code>memory_order_relaxed</code> without changing the correctness of C++ code.
				921	The lock implementation already enforces the required memory ordering
				922	with respect to access by other threads, and <code>memory_order_relaxed</code>
				923	specifies that essentially no additional ordering constraints need to be
				924	enforced for the atomic access.</p>
				925
				926	<p>There is no real analog to this in Java.</p>
				927
				928	<h3 id="hint_only">Result is not relied upon for correctness</h3>
				929
				930	<p>When we use a racing load only to generate a hint, it's generally also OK
				931	to not enforce any memory ordering for the load. If the value is
				932	not reliable, we also can't reliably use the result to infer anything about
				933	other variables. Thus it's OK
				934	if memory ordering is not guaranteed, and the load is
				935	supplied with a <code>memory_order_relaxed</code> argument.</p>
				936
				937	<p>A common
				938	instance of this is the use of C++ <code>compare_exchange</code>
				939	to atomically replace <code>x</code> by <code>f(x)</code>.
				940	The initial load of <code>x</code> to compute <code>f(x)</code>
				941	does not need to be reliable. If we get it wrong, the
				942	<code>compare_exchange</code> will fail and we will retry.
				943	It is fine for the initial load of <code>x</code> to use
				944	a <code>memory_order_relaxed</code> argument; only memory ordering
				945	for the actual <code>compare_exchange</code> matters.</p>
				946
				947	<h3 id="unread">Atomically modified but unread data</h3>
				948
				949	<p>Occasionally data is modified in parallel by multiple threads, but
				950	not examined until the parallel computation is complete. A good
				951	example of this is a counter that is atomically incremented (e.g.
				952	using <code>fetch_add()</code> in C++ or
				953	<code>atomic_fetch_add_explicit()</code>
				954	in C) by multiple threads in parallel, but the result of these calls
				955	is always ignored. The resulting value is only read at the end,
				956	after all updates are complete.</p>
				957
				958	<p>In this case, there is no way to tell whether accesses to this data
				959	was reordered, and hence C++ code may use a <code>memory_order_relaxed</code>
				960	argument.<p>
				961
				962	<p>Simple event counters are a common example of this. Since it is
				963	so common, it is worth making some observations about this case:</p>
				964
				965	<ul>
				966	<li> Use of <code>memory_order_relaxed</code> improves performance,
				967	but may not address the most important performance issue: Every update
				968	requires exclusive access to the cache line holding the counter. This
				969	results in a cache miss every time a new thread accesses the counter.
				970	If updates are frequent and alternate between threads, it is much faster
				971	to avoid updating the shared counter every time by,
				972	for example, using thread-local counters and summing them at the end.
				973	<li> This technique is combinable with the previous section: It is possible to
				974	concurrently read approximate and unreliable values while they are being updated,
				975	with all operations using <code>memory_order_relaxed</code>.
				976	But it is important to treat the resulting values as completely unreliable.
				977	Just because the count appears to have been incremented once does not
				978	mean another thread can be counted on to have reached the point
				979	at which the increment has been performed. The increment may instead have
				980	been reordered with earlier code. (As for the similar case we mentioned
				981	earlier, C++ does guarantee that a second load of such a counter will not
				982	return a value less than an earlier load in the same thread. Unless of
				983	course the counter overflowed.)
				984	<li> It is common to find code that tries to compute approximate
				985	counter values by performing individual atomic (or not) reads and writes, but
				986	not making the increment as a whole atomic. The usual argument is that
				987	this is "close enough" for performance counters or the like.
				988	It's typically not.
				989	When updates are sufficiently frequent (a case
				990	you probably care about), a large fraction of the counts are typically
				991	lost. On a quad core device, more than half the counts may commonly be lost.
				992	(Easy exercise: construct a two thread scenario in which the counter is
				993	updated a million times, but the final counter value is one.)
				994	</ul>
				995
				996	<h3 id="flag">Simple Flag communication</h3>
				997
				998	<p>A <code>memory_order_release</code> store (or read-modify-write operation)
				999	ensures that if subsequently a <code>memory_order_acquire</code> load
				1000	(or read-modify-write operation) reads the written value, then it will
				1001	also observe any stores (ordinary or atomic) that preceded the
				1002	A <code>memory_order_release</code> store. Conversely, any loads
				1003	preceding the <code>memory_order_release</code> will not observe any
				1004	stores that followed the <code>memory_order_acquire</code> load.
				1005	Unlike <code>memory_order_relaxed</code>, this allows such atomic operations
				1006	to be used to communicate the progress of one thread to another.</p>
				1007
				1008	<p>For example, we can rewrite the double-checked locking example
				1009	from above in C++ as</p>
				1010
				1011	<pre>
				1012	class MyClass {
				1013	private:
				1014	atomic<Helper*> helper {nullptr};
				1015	mutex mtx;
				1016
				1017	public:
				1018	Helper* getHelper() {
				1019	Helper* myHelper = helper.load(memory_order_acquire);
				1020	if (myHelper == nullptr) {
				1021	lock_guard<mutex> lg(mtx);
				1022	myHelper = helper.load(memory_order_relaxed);
				1023	if (myHelper == nullptr) {
				1024	myHelper = new Helper();
				1025	helper.store(myHelper, memory_order_release);
				1026	}
				1027	}
				1028	return myHelper;
				1029	}
				1030	};
				1031	</pre>
				1032
				1033	<p>The acquire load and release store ensure that if we see a non-null
				1034	<code>helper</code>, then we will also see its fields correctly initialized.
				1035	We've also incorporated the prior observation that non-racing loads
				1036	can use <code>memory_order_relaxed</code>.</p>
				1037
				1038	<p>A Java programmer could conceivably represent <code>helper</code> as a
				1039	<code>java.util.concurrent.atomic.AtomicReference<Helper></code>
				1040	and use <code>lazySet()</code> as the release store. The load
				1041	operations would continue to use plain <code>get()</code> calls.</p>
				1042
				1043	<p>In both cases, our performance tweaking concentrated on the initialization
				1044	path, which is unlikely to be performance critical.
				1045	A more readable compromise might be:</p>
				1046
				1047	<pre>
				1048	Helper* getHelper() {
				1049	Helper* myHelper = helper.load(memory_order_acquire);
				1050	if (myHelper != nullptr) {
				1051	return myHelper;
				1052	}
				1053	lock_guard&ltmutex> lg(mtx);
				1054	if (helper == nullptr) {
				1055	helper = new Helper();
				1056	}
				1057	return helper;
				1058	}
				1059	</pre>
				1060
				1061	<p>This provides the same fast path, but resorts to default,
				1062	sequentially-consistent, operations on the non-performance-critical slow
				1063	path.</p>
				1064
				1065	<p>Even here, <code>helper.load(memory_order_acquire)</code> is
				1066	likely to generate the same code on current Android-supported
				1067	architectures as a plain (sequentially-consistent) reference to
				1068	<code>helper</code>. Really the most beneficial optimization here
				1069	may be the introduction of <code>myHelper</code> to eliminate a
				1070	second load, though a future compiler might do that automatically.</p>
				1071
				1072	<p>Acquire/release ordering does not prevent stores from getting visibly
				1073	delayed, and does not ensure that stores become visible to other threads
				1074	in a consistent order. As a result, it does not support a tricky,
				1075	but fairly common coding pattern exemplified by Dekker's mutual exclusion
				1076	algorithm: All threads first set a flag indicating that they want to do
				1077	something; if a thread <i>t</i> then notices that no other thread is
				1078	trying to do something, it can safely proceed, knowing that there
				1079	will be no interference. No other thread will be
				1080	able to proceed, since <i>t</i>'s flag is still set. This fails
				1081	if the flag is accessed using acquire/release ordering, since that doesn't
				1082	prevent making a thread's flag visible to others late, after they have
				1083	erroneously proceeded. Default <code>memory_order_seq_cst</code>
				1084	does prevent it.</p>
				1085
				1086	<h3 id="immutable">Immutable fields</h3>
				1087
				1088	<p> If an object field is initialized on first use and then never changed,
				1089	it may be possible to initialize and subsequently read it using weakly
				1090	ordered accesses. In C++, it could be declared as <code>atomic</code>
				1091	and accessed using <code>memory_order_relaxed</code> or in Java, it
				1092	could be declared without <code>volatile</code> and accessed without
				1093	special measures. This requires that all of the following hold:</p>
				1094
				1095	<ul>
				1096	<li>It should be possible to tell from the value of the field itself
				1097	whether it has already been initialized. To access the field,
				1098	the fast path test-and-return value should read the field only once.
				1099	In Java the latter is essential. Even if the field tests as initialized,
				1100	a second load may read the earlier uninitialized value. In C++
				1101	the "read once" rule is merely good practice.
				1102	<li>Both initialization and subsequent loads must be atomic,
				1103	in that partial updates should not be visible. For Java, the field
				1104	should not be a <code>long</code> or <code>double</code>. For C++,
				1105	an atomic assignment is required; constructing it in place will not work, since
				1106	construction of an <code>atomic</code> is not atomic.
				1107	<li>Repeated initializations must be safe, since multiple threads
				1108	may read the uninitialized value concurrently. In C++, this generally
				1109	follows from the "trivially copyable" requirement imposed for all
				1110	atomic types; types with nested owned pointers would require
				1111	deallocation in the
				1112	copy constructor, and would not be trivially copyable. For Java,
				1113	certain reference types are acceptable:
				1114	<li>Java references are limited to immutable types containing only final
				1115	fields. The constructor of the immutable type should not publish
				1116	a reference to the object. In this case the Java final field rules
				1117	ensure that if a reader sees the reference, it will also see the
				1118	initialized final fields. C++ has no analog to these rules and
				1119	pointers to owned objects are unacceptable for this reason as well (in
				1120	addition to violating the "trivially copyable" requirements).
				1121	</ul>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1122
				1123	<h2 id="closing_notes">Closing Notes</h2>
				1124
				1125	<p>While this document does more than merely scratch the surface, it doesn’t
				1126	manage more than a shallow gouge. This is a very broad and deep topic. Some
				1127	areas for further exploration:</p>
				1128
				1129	<ul>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1130	<li>The actual Java and C++ memory models are expressed in terms of a
				1131	<em>happens-before</em> relation that specifies when two actions are guaranteed
				1132	to occur in a certain order. When we defined a data race, we informally
				1133	talked about two memory accesses happening "simultaneously".
				1134	Officially this is defined as neither one happening before the other.
				1135	It is instructive to learn the actual definitions of <em>happens-before</em>
				1136	and <em>synchronizes-with</em> in the Java or C++ Memory Model.
				1137	Although the intuitive notion of "simultaneously" is generally good
				1138	enough, these definitions are instructive, particularly if you
				1139	are contemplating using weakly ordered atomic operations in C++.
				1140	(The current Java specification only defines <code>lazySet()</code>
				1141	very informally.)</li>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1142	<li>Explore what compilers are and aren’t allowed to do when reordering code.
				1143	(The JSR-133 spec has some great examples of legal transformations that lead to
				1144	unexpected results.)</li>
				1145	<li>Find out how to write immutable classes in Java and C++. (There’s more to
				1146	it than just “don’t change anything after construction”.)</li>
				1147	<li>Internalize the recommendations in the Concurrency section of <em>Effective
				1148	Java, 2nd Edition.</em> (For example, you should avoid calling methods that are
				1149	meant to be overridden while inside a synchronized block.)</li>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1150	<li>Read through the {@link java.util.concurrent} and {@link
				1151	java.util.concurrent.atomic} APIs to see what's available. Consider using
				1152	concurrency annotations like <code>@ThreadSafe</code> and
				1153	<code>@GuardedBy</code> (from net.jcip.annotations).</li>
				1154	</ul>
				1155
				1156	<p>The <a href="#more">Further Reading</a> section in the appendix has links to
				1157	documents and web sites that will better illuminate these topics.</p>
				1158
				1159	<h2 id="appendix">Appendix</h2>
				1160
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1161	<h3 id="sync_stores">Implementing synchronization stores</h3>
				1162
				1163	<p><em>(This isn’t something most programmers will find themselves implementing,
				1164	but the discussion is illuminating.)</em></p>
				1165
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1166	<p> For small built-in types like <code>int</code>, and hardware supported by
				1167	Android, ordinary load and store instructions ensure that a store
				1168	will be made visible either in its entirety, or not at all, to another
				1169	processor loading the same location. Thus some basic notion
				1170	of "atomicity" is provided for free.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1171
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1172	<p> As we saw before, this does not suffice. In order to ensure sequential
				1173	consistency we also need to prevent reordering of operations, and to ensure
				1174	that memory operations become visible to other processes in a consistent
				1175	order. It turns out that the latter is automatic on Android-supported
				1176	hardware, provided we make judicious choices for enforcing the former,
				1177	so we largely ignore it here.</p>
				1178
				1179	<p> Order of memory operations is preserved by both preventing reordering
				1180	by the compiler, and preventing reordering by the hardware. Here we focus
				1181	on the latter.
				1182
				1183	<p> Memory ordering on current hardware is generally enforced with
				1184	"fence" instructions that
				1185	roughly prevent instructions following the fence from becoming visible
				1186	before instructions preceding the fence. (These are also commonly
				1187	called "barrier" instructions, but that risks confusion with
				1188	<code>pthread_barrier</code>-style barriers, which do much more
				1189	than this.) The precise meaning of
				1190	fence instructions is a fairly complicated topic that has to address
				1191	the way in which guarantees provided by multiple different kinds of fences
				1192	interact, and how these combine with other ordering guarantees usually
				1193	provided by the hardware. This is a high level overview, so we will
				1194	gloss over these details.</p>
				1195
				1196	<p> The most basic kind of ordering guarantee is that provided by C++
				1197	<code>memory_order_acquire</code> and <code>memory_order_release</code>
				1198	atomic operations: Memory operations preceding a release store
				1199	should be visible following an acquire load. On ARM, this is
				1200	enforced by:</p>
				1201
				1202	<ul>
				1203	<li>Preceding the store instruction with a suitable fence instruction.
				1204	This prevents all prior memory accesses from being reordered with the
				1205	store instruction. (It also unnecessarily prevents reordering with
				1206	later store instruction. ARMv8 provides an alternative that doesn't share
				1207	this problem.)
				1208	<li>Following the load instruction with a suitable fence instruction,
				1209	preventing the load from being reordered with subsequent accesses.
				1210	(And once again providing unneeded ordering with at least earlier loads.)
				1211	</ul>
				1212
				1213	<p> Together these suffice for C++ acquire/release ordering.
				1214	They are necessary, but not sufficient, for Java <code>volatile</code>
				1215	or C++ sequentially consistent <code>atomic</code>.</p>
				1216
				1217	<p>To see what else we need, consider the fragment of Dekker’s algorithm
				1218	we briefly mentioned earlier.
				1219	<code>flag1</code> and <code>flag2</code> are C++ <code>atomic</code>
				1220	or Java <code>volatile</code> variables, both initially false.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1221
				1222	<table>
				1223	<tr>
				1224	<th>Thread 1</th>
				1225	<th>Thread 2</th>
				1226	</tr>
				1227	<tr>
				1228	<td><code>flag1 = true<br />
				1229	if (flag2 == false)<br />
				1230	<em>critical-stuff</em></code></td>
				1231	<td><code>flag2 = true<br />
				1232	if (flag1 == false)<br />
				1233	<em>critical-stuff</em></code></td>
				1234	</tr>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1235	</table>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1236
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1237	<p>Sequential consistency implies that one of the assignments to
				1238	<code>flag</code><i>n</i> must be executed first, and be seen by the
				1239	test in the other thread. Thus, we will never see
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1240	these threads executing the “critical-stuff” simultaneously.</p>
				1241
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1242	<p>But the fencing required for acquire-release ordering only adds
				1243	fences at the beginning and end of each thread, which doesn't help
				1244	here. We additionally need to ensure that if a
				1245	<code>volatile</code>/<code>atomic</code> store is followed by
				1246	a <code>volatile</code>/<code>atomic</code> load, the two are not reordered.
				1247	This is normally enforced by add a fence not just before a
				1248	sequentially consistent store, but also after it.
				1249	(This is again much stronger than required, since this fence typically orders
				1250	all earlier memory accesses with respect to all later ones. Again ARMv8
				1251	offers a more targeted solution.)</p>
				1252
				1253	<p>We could instead associate the extra fence with sequentially
				1254	consistent loads. Since stores are less frequent, the convention
				1255	we described is more common and used on Android.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1256
				1257	<p>As we saw in an earlier section, we need to insert a store/load barrier
				1258	between the two operations. The code executed in the VM for a volatile access
				1259	will look something like this:</p>
				1260
				1261	<table>
				1262	<tr>
				1263	<th>volatile load</th>
				1264	<th>volatile store</th>
				1265	</tr>
				1266	<tr>
				1267	<td><code>reg = A<br />
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1268	<em>fence for "acquire" (1)</em></code></td>
				1269	<td><code><em>fence for "release" (2)</em><br />
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1270	A = reg<br />
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1271	<em>fence for later atomic load (3)</em></code></td>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1272	</tr>
				1273	</table>
				1274
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1275	<p>Real machine architectures commonly provide multiple types of
				1276	fences, which order different types of accesses and may have
				1277	different cost. The choice between these is subtle, and influenced
				1278	by the need to ensure that stores are made visible to other cores in
				1279	a consistent order, and that the memory ordering imposed by the
				1280	combination of multiple fences composes correctly. For more details,
				1281	please see the University of Cambridge page with <a href="#more">
				1282	collected mappings of atomics to actual processors</a>.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1283
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1284	<p>On some architectures, notably x86, the "acquire" and "release"
				1285	barriers are unnecessary, since the hardware always implicitly
				1286	enforces sufficient ordering. Thus on x86 only the last fence (3)
				1287	is really generated. Similarly on x86, atomic read-modify-write
				1288	operations implicitly include a strong fence. Thus these never
				1289	require any fences. On ARM all fences we discussed above are
				1290	required.</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1291
				1292	<h3 id="more">Further reading</h3>
				1293
				1294	<p>Web pages and documents that provide greater depth or breadth. The more generally useful articles are nearer the top of the list.</p>
				1295
				1296	<dl>
				1297	<dt>Shared Memory Consistency Models: A Tutorial</dt>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1298	<dd>Written in 1995 by Adve & Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1299	<br /><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</a></dd>
				1300
				1301	<dt>Memory Barriers</dt>
				1302	<dd>Nice little article summarizing the issues.
				1303	<br /><a href="http://en.wikipedia.org/wiki/Memory_barrier">http://en.wikipedia.org/wiki/Memory_barrier</a></dd>
				1304
				1305	<dt>Threads Basics</dt>
				1306	<dd>An introduction to multi-threaded programming in C++ and Java, by Hans Boehm. Excellent discussion of data races and basic synchronization methods.
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1307	<br /><a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html">http://www.hboehm.info/c++mm/threadsintro.html</a></dd>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1308
				1309	<dt>Java Concurrency In Practice</dt>
				1310	<dd>Published in 2006, this book covers a wide range of topics in great detail. Highly recommended for anyone writing multi-threaded code in Java.
				1311	<br /><a href="http://www.javaconcurrencyinpractice.com">http://www.javaconcurrencyinpractice.com</a></dd>
				1312
				1313	<dt>JSR-133 (Java Memory Model) FAQ</dt>
				1314	<dd>A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields.
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1315	(A bit dated, particularly when it discusses other languages.)
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1316	<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html">http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html</a></dd>
				1317
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1318	<dt>Validity of Program Transformations in the Java Memory Model</dt>
				1319	<dd>A rather technical explanation of remaining problems with the
				1320	Java memory model. These issues do not apply to data-race-free
				1321	programs.
				1322	<br /><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.1790&type=pdf">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.1790&type=pdf</a></dd>
				1323
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1324	<dt>Overview of package java.util.concurrent</dt>
				1325	<dd>The documentation for the <code>java.util.concurrent</code> package. Near the bottom of the page is a section entitled “Memory Consistency Properties” that explains the guarantees made by the various classes.
				1326	<br />{@link java.util.concurrent java.util.concurrent} Package Summary</dd>
				1327
				1328	<dt>Java Theory and Practice: Safe Construction Techniques in Java</dt>
				1329	<dd>This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors.
				1330	<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp0618.html">http://www.ibm.com/developerworks/java/library/j-jtp0618.html</a></dd>
				1331
				1332	<dt>Java Theory and Practice: Managing Volatility</dt>
				1333	<dd>A nice article describing what you can and can’t accomplish with volatile fields in Java.
				1334	<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp06197.html">http://www.ibm.com/developerworks/java/library/j-jtp06197.html</a></dd>
				1335
				1336	<dt>The “Double-Checked Locking is Broken” Declaration</dt>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1337	<dd>Bill Pugh’s detailed explanation of the various ways in which double-checked locking is broken without <code>volatile</code> or <code>atomic</code>.
				1338	Includes C/C++ and Java.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1339	<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html</a></dd>
				1340
				1341	<dt>[ARM] Barrier Litmus Tests and Cookbook</dt>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1342	<dd>A discussion of ARM SMP issues, illuminated with short snippets of ARM code. If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this. Also describes the instructions used for memory barriers on executable code (possibly useful if you’re generating code on the fly). Note that this predates ARMv8, which also
				1343	supports an additional set of memory ordering instructions.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1344	<br /><a href="http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf">http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf</a></dd>
				1345
				1346	<dt>Linux Kernel Memory Barriers
				1347	<dd>Documentation for Linux kernel memory barriers. Includes some useful examples and ASCII art.
				1348	<br/><a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a></dd>
				1349
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1350	<dt>ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), section 1.10 and clause 29 (“Atomic operations library”)</dt>
				1351	<dd>Draft standard for C++ atomic operation features. This version is
				1352	close to the C++14 standard, which includes minor changes in this area
				1353	from C++11.
				1354	<br /><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4527.pdf">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4527.pdf</a>
				1355	<br />(intro: <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf</a>)</dd>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1356
				1357	<dt>ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (“Atomics <stdatomic.h>”)</dt>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1358	<dd>Draft standard for ISO/IEC 9899-201x C atomic operation features.
				1359	For details, also check later defect reports.
				1360	<br /><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf</a></dd>
				1361
				1362	<dt>C/C++11 mappings to processors (University of Cambridge)</dt>
				1363	<dd>Jaroslav Sevcik and Peter Sewell's collection of translations
				1364	of C++ atomics to various common processor instruction sets.
				1365	<br /><a href="http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html">
				1366	http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html</a></dd>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1367
				1368	<dt>Dekker’s algorithm</dt>
				1369	<dd>The “first known correct solution to the mutual exclusion problem in concurrent programming”. The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware.
				1370	<br /><a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">http://en.wikipedia.org/wiki/Dekker's_algorithm</a></dd>
				1371
				1372	<dt>Comments on ARM vs. Alpha and address dependencies</dt>
				1373	<dd>An e-mail on the arm-kernel mailing list from Catalin Marinas. Includes a nice summary of address and control dependencies.
				1374	<br /><a href="http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html">http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html</a></dd>
				1375
				1376	<dt>What Every Programmer Should Know About Memory</dt>
				1377	<dd>A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper.
				1378	<br /><a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a></dd>
				1379
				1380	<dt>Reasoning about the ARM weakly consistent memory model</dt>
				1381	<dd>This paper was written by Chong & Ishtiaq of ARM, Ltd. It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion. The definition of “observability” used here comes from this paper.
				1382	<br /><a href="http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711">http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711</a></dd>
				1383
				1384	<dt>The JSR-133 Cookbook for Compiler Writers</dt>
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1385	<dd>Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation. It contains the initial set of implementation guidelines
				1386	for the Java memory model that was used by many compiler writers, and is
				1387	still widely cited and likely to provide insight.
				1388	Unfortunately, the four fence varieties discussed here are not a good
				1389	match for Android-supported architectures, and the above C++11 mappings
				1390	are now a better source of precise recipes, even for Java.
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1391	<br /><a href="http://g.oswego.edu/dl/jmm/cookbook.html">http://g.oswego.edu/dl/jmm/cookbook.html</a></dd>
				1392
Hans Boehm	9e629ad	2015-09-14 13:50:00 -0700	[diff] [blame]	1393	<dt>x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors</dt>
				1394	<dd>A precise description of the x86 memory model. Precise descriptions of
				1395	the ARM memory model are unfortunately significantly more complicated.
				1396	<br /><a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf">http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf</a></dd>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1397	</dl>