Blame - docs/html/training/articles/smp.jd - platform/frameworks/base

blob: a95931b536a9c4e7a517abd5138706eaf9cf6700 [file] [log] [blame]

Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1	page.title=SMP Primer for Android
Joe Fernandez	33baa5a	2013-11-14 11:41:19 -0800	[diff] [blame]	2	page.tags=ndk,native
Scott Main	1c2dea0	2013-04-10 18:59:29 -0700	[diff] [blame]	3
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	4	page.article=true
				5	@jd:body
				6
				7	<div id="tb-wrapper">
				8	<div id="tb">
				9	<h2>In this document</h2>
				10	<ol class="nolist">
				11	<li><a href="#theory">Theory</a>
				12	<ol class="nolist">
				13	<li style="margin: 3px 0 0"><a href="#mem_consistency">Memory consistency models</a>
				14	<ol class="nolist">
				15	<li style="margin:0"><a href="#proc_consistency">Processor consistency</a></li>
				16	<li style="margin:0"><a href="#cpu_cache">CPU cache behavior</a></li>
				17	<li style="margin:0"><a href="#observability">Observability</a></li>
				18	<li style="margin:0"><a href="#ordering">ARM’s weak ordering</a></li>
				19	</ol>
				20	</li>
				21	<li style="margin:3px 0 0"><a href="#datamem_barriers">Data memory barriers</a>
				22	<ol class="nolist">
				23	<li style="margin:0"><a href="#ss_ll">Store/store and load/load</a></li>
				24	<li style="margin:0"><a href="#ls_sl">Load/store and store/load</a></li>
				25	<li style="margin:0"><a href="#barrier_inst">Barrier instructions</a></li>
				26	<li style="margin:0"><a href="#addr_dep">Address dependencies and causal consistency</a></li>
				27	<li style="margin:0"><a href="#membarrier_summry">Memory barrier summary</a></li>
				28	</ol>
				29	</li>
				30	<li style="margin:3px 0 0"><a href="#atomic_ops">Atomic operations</a>
				31	<ol class="nolist">
				32	<li style="margin:0"><a href="#atomic_essentials">Atomic essentials</a></li>
				33	<li style="margin:0"><a href="#atomic_barrierpairing">Atomic + barrier pairing</a></li>
				34	<li style="margin:0"><a href="#acq_rel">Acquire and release</a></li>
				35	</ol>
				36	</li>
				37	</ol>
				38	</li>
				39	<li><a href="#practice">Practice</a>
				40	<ol class="nolist">
				41	<li style="margin:3px 0 0"><a href="#c_dont">What not to do in C</a>
				42	<ol class="nolist">
				43	<li style="margin:0"><a href="#volatile">C/C++ and “volatile”</a></li>
				44	<li style="margin:0"><a href="#examplesc">Examples</a></li>
				45	</ol>
				46	</li>
				47	<li style="margin:3px 0 0"><a href="#j_dont">What not to do in Java</a>
				48	<ol class="nolist">
				49	<li style="margin:0"><a href="#sync_volatile">“synchronized” and “volatile”</a></li>
				50	<li style="margin:0"><a href="#examplesj">Examples</a></li>
				51	</ol>
				52	</li>
				53	<li style="margin:3px 0 0"><a href="#bestpractice">What to do</a>
				54	<ol class="nolist">
				55	<li style="margin:0"><a href="#advice">General advice</a></li>
				56	<li style="margin:0"><a href="#sync_guarantees">Synchronization primitive guarantees</a></li>
				57	<li style="margin:0"><a href="#ccpp_changes">Upcoming changes to C/C++</a></li>
				58	</ol>
				59	</li>
				60	</ol>
				61	</li>
				62	<li><a href="#closing_notes">Closing Notes</a></li>
				63	<li><a href="#appendix">Appendix</a>
				64	<ol class="nolist">
				65	<li style="margin:0"><a href="#smp_failure_example">SMP failure example</a></li>
				66	<li style="margin:0"><a href="#sync_stores">Implementing synchronization stores</a></li>
				67	<li style="margin:0"><a href="#more">Further reading</a></li>
				68	</ol>
				69	</li>
				70	</ol>
				71	</div>
				72	</div>
				73
				74	<p>Android 3.0 and later platform versions are optimized to support
				75	multiprocessor architectures. This document introduces issues that
				76	can arise when writing code for symmetric multiprocessor systems in C, C++, and the Java
				77	programming language (hereafter referred to simply as “Java” for the sake of
Mark Lu	c4a0139	2016-07-18 10:42:11 -0700	[diff] [blame]	78	brevity). It's intended as a primer for Android app developers, not as a complete
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	79	discussion on the subject. The focus is on the ARM CPU architecture.</p>
				80
				81	<p>If you’re in a hurry, you can skip the <a href="#theory">Theory</a> section
				82	and go directly to <a href="#practice">Practice</a> for best practices, but this
				83	is not recommended.</p>
				84
				85
				86	<h2 id="intro">Introduction</h2>
				87
				88	<p>SMP is an acronym for “Symmetric Multi-Processor”. It describes a design in
				89	which two or more identical CPU cores share access to main memory. Until
				90	a few years ago, all Android devices were UP (Uni-Processor).</p>
				91
				92	<p>Most — if not all — Android devices do have multiple CPUs, but generally one
				93	of them is used to run applications while others manage various bits of device
				94	hardware (for example, the radio). The CPUs may have different architectures, and the
				95	programs running on them can’t use main memory to communicate with each
				96	other.</p>
				97
				98	<p>Most Android devices sold today are built around SMP designs,
				99	making things a bit more complicated for software developers. The sorts of race
				100	conditions you might encounter in a multi-threaded program are much worse on SMP
				101	when two or more of your threads are running simultaneously on different cores.
				102	What’s more, SMP on ARM is more challenging to work with than SMP on x86. Code
				103	that has been thoroughly tested on x86 may break badly on ARM.</p>
				104
				105	<p>The rest of this document will explain why, and tell you what you need to do
				106	to ensure that your code behaves correctly.</p>
				107
				108
				109	<h2 id="theory">Theory</h2>
				110
				111	<p>This is a high-speed, glossy overview of a complex subject. Some areas will
				112	be incomplete, but none of it should be misleading or wrong.</p>
				113
				114	<p>See <a href="#more">Further reading</a> at the end of the document for
				115	pointers to more thorough treatments of the subject.</p>
				116
				117	<h3 id="mem_consistency">Memory consistency models</h3>
				118
				119	<p>Memory consistency models, or often just “memory models”, describe the
				120	guarantees the hardware architecture makes about memory accesses. For example,
				121	if you write a value to address A, and then write a value to address B, the
				122	model might guarantee that every CPU core sees those writes happen in that
				123	order.</p>
				124
				125	<p>The model most programmers are accustomed to is <em>sequential
				126	consistency</em>, which is described like this <span
				127	style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Adve &
				128	Gharachorloo</a>)</span>:</p>
				129
				130	<ul>
				131	<li>All memory operations appear to execute one at a time</li>
				132	<li>All operations on a single processor appear to execute in the order described
				133	by that processor's program.</li>
				134	</ul>
				135
				136	<p>If you look at a bit of code and see that it does some reads and writes from
				137	memory, on a sequentially-consistent CPU architecture you know that the code
				138	will do those reads and writes in the expected order. It’s possible that the
				139	CPU is actually reordering instructions and delaying reads and writes, but there
				140	is no way for code running on the device to tell that the CPU is doing anything
				141	other than execute instructions in a straightforward manner. (We’re ignoring
				142	memory-mapped device driver I/O for the moment.)</p>
				143
				144	<p>To illustrate these points it’s useful to consider small snippets of code,
				145	commonly referred to as <em>litmus tests</em>. These are assumed to execute in
				146	<em>program order</em>, that is, the order in which the instructions appear here is
				147	the order in which the CPU will execute them. We don’t want to consider
				148	instruction reordering performed by compilers just yet.</p>
				149
				150	<p>Here’s a simple example, with code running on two threads:</p>
				151
				152	<table>
				153	<tr>
				154	<th>Thread 1</th>
				155	<th>Thread 2</th>
				156	</tr>
				157	<tr>
				158	<td><code>A = 3<br />
				159	B = 5</code></td>
				160	<td><code>reg0 = B<br />
				161	reg1 = A</code></td>
				162	</tr>
				163	</table>
				164
				165	<p>In this and all future litmus examples, memory locations are represented by
				166	capital letters (A, B, C) and CPU registers start with “reg”. All memory is
				167	initially zero. Instructions are executed from top to bottom. Here, thread 1
				168	stores the value 3 at location A, and then the value 5 at location B. Thread 2
				169	loads the value from location B into reg0, and then loads the value from
				170	location A into reg1. (Note that we’re writing in one order and reading in
				171	another.)</p>
				172
				173	<p>Thread 1 and thread 2 are assumed to execute on different CPU cores. You
				174	should <strong>always</strong> make this assumption when thinking about
				175	multi-threaded code.</p>
				176
				177	<p>Sequential consistency guarantees that, after both threads have finished
				178	executing, the registers will be in one of the following states:</p>
				179
				180
				181	<table>
				182	<tr>
				183	<th>Registers</th>
				184	<th>States</th>
				185	</tr>
				186	<tr>
				187	<td>reg0=5, reg1=3</td>
				188	<td>possible (thread 1 ran first)</td>
				189	</tr>
				190	<tr>
				191	<td>reg0=0, reg1=0</td>
				192	<td>possible (thread 2 ran first)</td>
				193	</tr>
				194	<tr>
				195	<td>reg0=0, reg1=3</td>
				196	<td>possible (concurrent execution)</td>
				197	</tr>
				198	<tr>
				199	<td>reg0=5, reg1=0</td>
				200	<td>never</td>
				201	</tr>
				202	</table>
				203
				204	<p>To get into a situation where we see B=5 before we see the store to A, either
				205	the reads or the writes would have to happen out of order. On a
				206	sequentially-consistent machine, that can’t happen.</p>
				207
				208	<p>Most uni-processors, including x86 and ARM, are sequentially consistent.
				209	Most SMP systems, including x86 and ARM, are not.</p>
				210
				211	<h4 id="proc_consistency">Processor consistency</h4>
				212
				213	<p>x86 SMP provides <em>processor consistency</em>, which is slightly weaker than
				214	sequential. While the architecture guarantees that loads are not reordered with
				215	respect to other loads, and stores are not reordered with respect to other
				216	stores, it does not guarantee that a store followed by a load will be observed
				217	in the expected order.</p>
				218
				219	<p>Consider the following example, which is a piece of Dekker’s Algorithm for
				220	mutual exclusion:</p>
				221
				222	<table>
				223	<tr>
				224	<th>Thread 1</th>
				225	<th>Thread 2</th>
				226	</tr>
				227	<tr>
				228	<td><code>A = true<br />
				229	reg1 = B<br />
				230	if (reg1 == false)<br />
				231	<em>critical-stuff</em></code></td>
				232	<td><code>B = true<br />
				233	reg2 = A<br />
				234	if (reg2 == false)<br />
				235	<em>critical-stuff</em></code></td>
				236	</tr>
				237	</table>
				238
				239	<p>The idea is that thread 1 uses A to indicate that it’s busy, and thread 2
				240	uses B. Thread 1 sets A and then checks to see if B is set; if not, it can
				241	safely assume that it has exclusive access to the critical section. Thread 2
				242	does something similar. (If a thread discovers that both A and B are set, a
				243	turn-taking algorithm is used to ensure fairness.)</p>
				244
				245	<p>On a sequentially-consistent machine, this works correctly. On x86 and ARM
				246	SMP, the store to A and the load from B in thread 1 can be “observed” in a
				247	different order by thread 2. If that happened, we could actually appear to
				248	execute this sequence (where blank lines have been inserted to highlight the
				249	apparent order of operations):</p>
				250
				251	<table>
				252	<tr>
				253	<th>Thread 1</th>
				254	<th>Thread 2</th>
				255	</tr>
				256	<tr>
				257	<td><code>reg1 = B<br />
				258	<br />
				259	<br />
				260	A = true<br />
				261	if (reg1 == false)<br />
				262	<em>critical-stuff</em></code></td>
				263
				264	<td><code><br />
				265	B = true<br />
				266	reg2 = A<br />
				267	<br />
				268	if (reg2 == false)<br />
				269	<em>critical-stuff</em></code></td>
				270	</tr>
				271	</table>
				272
				273	<p>This results in both reg1 and reg2 set to “false”, allowing the threads to
				274	execute code in the critical section simultaneously. To understand how this can
				275	happen, it’s useful to know a little about CPU caches.</p>
				276
				277	<h4 id="cpu_cache">CPU cache behavior</h4>
				278
				279	<p>This is a substantial topic in and of itself. An extremely brief overview
				280	follows. (The motivation for this material is to provide some basis for
				281	understanding why SMP systems behave as they do.)</p>
				282
				283	<p>Modern CPUs have one or more caches between the processor and main memory.
				284	These are labeled L1, L2, and so on, with the higher numbers being successively
				285	“farther” from the CPU. Cache memory adds size and cost to the hardware, and
				286	increases power consumption, so the ARM CPUs used in Android devices typically
				287	have small L1 caches and little or no L2/L3.</p>
				288
				289	<p>Loading or storing a value into the L1 cache is very fast. Doing the same to
				290	main memory can be 10-100x slower. The CPU will therefore try to operate out of
				291	the cache as much as possible. The <em>write policy</em> of a cache determines when data
				292	written to it is forwarded to main memory. A <em>write-through</em> cache will initiate
				293	a write to memory immediately, while a <em>write-back</em> cache will wait until it runs
				294	out of space and has to evict some entries. In either case, the CPU will
				295	continue executing instructions past the one that did the store, possibly
				296	executing dozens of them before the write is visible in main memory. (While the
				297	write-through cache has a policy of immediately forwarding the data to main
				298	memory, it only <strong>initiates</strong> the write. It does not have to wait
				299	for it to finish.)</p>
				300
				301	<p>The cache behavior becomes relevant to this discussion when each CPU core has
				302	its own private cache. In a simple model, the caches have no way to interact
				303	with each other directly. The values held by core #1’s cache are not shared
				304	with or visible to core #2’s cache except as loads or stores from main memory.
				305	The long latencies on memory accesses would make inter-thread interactions
				306	sluggish, so it’s useful to define a way for the caches to share data. This
				307	sharing is called <em>cache coherency</em>, and the coherency rules are defined
				308	by the CPU architecture’s <em>cache consistency model</em>.</p>
				309
				310	<p>With that in mind, let’s return to the Dekker example. When core 1 executes
				311	“A = 1”, the value gets stored in core 1’s cache. When core 2 executes “if (A
				312	== 0)”, it might read from main memory or it might read from core 2’s cache;
				313	either way it won’t see the store performed by core 1. (“A” could be in core
				314	2’s cache because of a previous load from “A”.)</p>
				315
				316	<p>For the memory consistency model to be sequentially consistent, core 1 would
				317	have to wait for all other cores to be aware of “A = 1” before it could execute
				318	“if (B == 0)” (either through strict cache coherency rules, or by disabling the
				319	caches entirely so everything operates out of main memory). This would impose a
				320	performance penalty on every store operation. Relaxing the rules for the
				321	ordering of stores followed by loads improves performance but imposes a burden
				322	on software developers.</p>
				323
				324	<p>The other guarantees made by the processor consistency model are less
				325	expensive to make. For example, to ensure that memory writes are not observed
				326	out of order, it just needs to ensure that the stores are published to other
				327	cores in the same order that they were issued. It doesn’t need to wait for
				328	store #1 to <strong>finish</strong> being published before it can start on store
				329	#2, it just needs to ensure that it doesn’t finish publishing #2 before it
				330	finishes publishing #1. This avoids a performance bubble.</p>
				331
				332	<p>Relaxing the guarantees even further can provide additional opportunities for
				333	CPU optimization, but creates more opportunities for code to behave in ways the
				334	programmer didn’t expect.</p>
				335
				336	<p>One additional note: CPU caches don’t operate on individual bytes. Data is
				337	read or written as <em>cache lines</em>; for many ARM CPUs these are 32 bytes. If you
				338	read data from a location in main memory, you will also be reading some adjacent
				339	values. Writing data will cause the cache line to be read from memory and
				340	updated. As a result, you can cause a value to be loaded into cache as a
				341	side-effect of reading or writing something nearby, adding to the general aura
				342	of mystery.</p>
				343
				344	<h4 id="observability">Observability</h4>
				345
				346	<p>Before going further, it’s useful to define in a more rigorous fashion what
				347	is meant by “observing” a load or store. Suppose core 1 executes “A = 1”. The
				348	store is <em>initiated</em> when the CPU executes the instruction. At some
				349	point later, possibly through cache coherence activity, the store is
				350	<em>observed</em> by core 2. In a write-through cache it doesn’t really
				351	<em>complete</em> until the store arrives in main memory, but the memory
				352	consistency model doesn’t dictate when something completes, just when it can be
				353	<em>observed</em>.</p>
				354
				355
				356	<p>(In a kernel device driver that accesses memory-mapped I/O locations, it may
				357	be very important to know when things actually complete. We’re not going to go
				358	into that here.)</p>
				359
				360	<p>Observability may be defined as follows:</p>
				361
				362	<ul>
				363	<li>"A write to a location in memory is said to be observed by an observer Pn
				364	when a subsequent read of the location by Pn would return the value written by
				365	the write."</li>
				366	<li>"A read of a location in memory is said to be observed by an observer Pm
				367	when a subsequent write to the location by Pm would have no effect on the value
				368	returned by the read." <span style="font-size:.9em;color:#777">(<em><a
				369	href="#more" style="color:#777">Reasoning about the ARM weakly consistent memory
				370	model</a></em>)</span></li>
				371	</ul>
				372
				373
				374	<p>A less formal way to describe it (where “you” and “I” are CPU cores) would be:</p>
				375
				376	<ul>
				377	<li>I have observed your write when I can read what you wrote</li>
				378	<li>I have observed your read when I can no longer affect the value you read</li>
				379	</ul>
				380
				381	<p>The notion of observing a write is intuitive; observing a read is a bit less
				382	so (don’t worry, it grows on you).</p>
				383
				384	<p>With this in mind, we’re ready to talk about ARM.</p>
				385
				386	<h4 id="ordering">ARM's weak ordering</h4>
				387
				388	<p>ARM SMP provides weak memory consistency guarantees. It does not guarantee that
				389	loads or stores are ordered with respect to each other.</p>
				390
				391	<table>
				392	<tr>
				393	<th>Thread 1</th>
				394	<th>Thread 2</th>
				395	</tr>
				396	<tr>
				397	<td><code>A = 41<br />
				398	B = 1 // “A is ready”</code></td>
				399	<td><code>loop_until (B == 1)<br />
				400	reg = A</code></td>
				401	</tr>
				402	</table>
				403
				404	<p>Recall that all addresses are initially zero. The “loop_until” instruction
				405	reads B repeatedly, looping until we read 1 from B. The idea here is that
				406	thread 2 is waiting for thread 1 to update A. Thread 1 sets A, and then sets B
				407	to 1 to indicate data availability.</p>
				408
				409	<p>On x86 SMP, this is guaranteed to work. Thread 2 will observe the stores
				410	made by thread 1 in program order, and thread 1 will observe thread 2’s loads in
				411	program order.</p>
				412
				413	<p>On ARM SMP, the loads and stores can be observed in any order. It is
				414	possible, after all the code has executed, for reg to hold 0. It’s also
				415	possible for it to hold 41. Unless you explicitly define the ordering, you
				416	don’t know how this will come out.</p>
				417
				418	<p>(For those with experience on other systems, ARM’s memory model is equivalent
				419	to PowerPC in most respects.)</p>
				420
				421
				422	<h3 id="datamem_barriers">Data memory barriers</h3>
				423
				424	<p>Memory barriers provide a way for your code to tell the CPU that memory
				425	access ordering matters. ARM/x86 uniprocessors offer sequential consistency,
				426	and thus have no need for them. (The barrier instructions can be executed but
				427	aren’t useful; in at least one case they’re hideously expensive, motivating
				428	separate builds for SMP targets.)</p>
				429
				430	<p>There are four basic situations to consider:</p>
				431
				432	<ol>
				433	<li>store followed by another store</li>
				434	<li>load followed by another load</li>
				435	<li>load followed by store</li>
				436	<li>store followed by load</li>
				437	</ol>
				438
				439	<h4 id="ss_ll">Store/store and load/load</h4>
				440
				441	<p>Recall our earlier example:</p>
				442
				443	<table>
				444	<tr>
				445	<th>Thread 1</th>
				446	<th>Thread 2</th>
				447	</tr>
				448	<tr>
				449	<td><code>A = 41<br />
				450	B = 1 // “A is ready”</code></td>
				451	<td><code>loop_until (B == 1)<br />
				452	reg = A</code></td>
				453	</tr>
				454	</table>
				455
				456
				457	<p>Thread 1 needs to ensure that the store to A happens before the store to B.
				458	This is a “store/store” situation. Similarly, thread 2 needs to ensure that the
				459	load of B happens before the load of A; this is a load/load situation. As
				460	mentioned earlier, the loads and stores can be observed in any order.</p>
				461
				462	<div style="padding:.5em 2em;">
				463	<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
				464	<p>Going back to the cache discussion, assume A and B are on separate cache
				465	lines, with minimal cache coherency. If the store to A stays local but the
				466	store to B is published, core 2 will see B=1 but won’t see the update to A. On
				467	the other side, assume we read A earlier, or it lives on the same cache line as
				468	something else we recently read. Core 2 spins until it sees the update to B,
				469	then loads A from its local cache, where the value is still zero.</p>
				470	</div>
				471	</div>
				472
				473	<p>We can fix it like this:</p>
				474
				475	<table>
				476	<tr>
				477	<th>Thread 1</th>
				478	<th>Thread 2</th>
				479	</tr>
				480	<tr>
				481	<td><code>A = 41<br />
				482	<em>store/store barrier</em><br />
				483	B = 1 // “A is ready”</code></td>
				484	<td><code>loop_until (B == 1)<br />
				485	<em>load/load barrier</em><br />
				486	reg = A</code></td>
				487	</tr>
				488	</table>
				489
				490	<p>The store/store barrier guarantees that <strong>all observers</strong> will
				491	observe the write to A before they observe the write to B. It makes no
				492	guarantees about the ordering of loads in thread 1, but we don’t have any of
				493	those, so that’s okay. The load/load barrier in thread 2 makes a similar
				494	guarantee for the loads there.</p>
				495
				496	<p>Since the store/store barrier guarantees that thread 2 observes the stores in
				497	program order, why do we need the load/load barrier in thread 2? Because we
				498	also need to guarantee that thread 1 observes the loads in program order.</p>
				499
				500	<div style="padding:.5em 2em;">
				501	<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
				502	<p>The store/store barrier could work by flushing all
				503	dirty entries out of the local cache, ensuring that other cores see them before
				504	they see any future stores. The load/load barrier could purge the local cache
				505	completely and wait for any “in-flight” loads to finish, ensuring that future
				506	loads are observed after previous loads. What the CPU actually does doesn’t
				507	matter, so long as the appropriate guarantees are kept. If we use a barrier in
				508	core 1 but not in core 2, core 2 could still be reading A from its local
				509	cache.</p>
				510	</div>
				511	</div>
				512
				513	<p>Because the architectures have different memory models, these barriers are
				514	required on ARM SMP but not x86 SMP.</p>
				515
				516	<h4 id="ls_sl">Load/store and store/load</h4>
				517
				518	<p>The Dekker’s Algorithm fragment shown earlier illustrated the need for a
				519	store/load barrier. Here’s an example where a load/store barrier is
				520	required:</p>
				521
				522	<table>
				523	<tr>
				524	<th>Thread 1</th>
				525	<th>Thread 2</th>
				526	</tr>
				527	<tr>
				528	<td><code>reg = A<br />
				529	B = 1 // “I have latched A”</code></td>
				530	<td><code>loop_until (B == 1)<br />
				531	A = 41 // update A</code></td>
				532	</tr>
				533	</table>
				534
				535	<p>Thread 2 could observe thread 1’s store of B=1 before it observe’s thread 1’s
				536	load from A, and as a result store A=41 before thread 1 has a chance to read A.
				537	Inserting a load/store barrier in each thread solves the problem:</p>
				538
				539	<table>
				540	<tr>
				541	<th>Thread 1</th>
				542	<th>Thread 2</th>
				543	</tr>
				544	<tr>
				545	<td><code>reg = A<br />
				546	<em>load/store barrier</em><br />
				547	B = 1 // “I have latched A”</code></td>
				548	<td><code>loop_until (B == 1)<br />
				549	<em>load/store barrier</em><br />
				550	A = 41 // update A</code></td>
				551	</tr>
				552	</table>
				553
				554	<div style="padding:.5em 2em;">
				555	<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
				556	<p>A store to local cache may be observed before a load from main memory,
				557	because accesses to main memory are so much slower. In this case, assume core
				558	1’s cache has the cache line for B but not A. The load from A is initiated, and
				559	while that’s in progress execution continues. The store to B happens in local
				560	cache, and by some means becomes available to core 2 while the load from A is
				561	still in progress. Thread 2 is able to exit the loop before it has observed
				562	thread 1’s load from A.</p>
				563
				564	<p>A thornier question is: do we need a barrier in thread 2? If the CPU doesn’t
				565	perform speculative writes, and doesn’t execute instructions out of order, can
				566	thread 2 store to A before thread 1’s read if thread 1 guarantees the load/store
				567	ordering? (Answer: no.) What if there’s a third core watching A and B?
				568	(Answer: now you need one, or you could observe B==0 / A==41 on the third core.)
				569	It’s safest to insert barriers in both places and not worry about the
				570	details.</p>
				571	</div>
				572	</div>
				573
				574	<p>As mentioned earlier, store/load barriers are the only kind required on x86
				575	SMP.</p>
				576
				577	<h4 id="barrier_inst">Barrier instructions</h4>
				578
				579	<p>Different CPUs provide different flavors of barrier instruction. For
				580	example:</p>
				581
				582	<ul>
				583	<li>Sparc V8 has a “membar” instruction that takes a 4-element bit vector. The
				584	four categories of barrier can be specified individually.</li>
				585	<li>Alpha provides “rmb” (load/load), “wmb” (store/store), and “mb” (full).
				586	(Trivia: the linux kernel provides three memory barrier functions with these
				587	names and behaviors.)</li>
				588	<li>x86 has a variety of options; “mfence” (introduced with SSE2) provides a
				589	full barrier.</li>
				590	<li>ARMv7 has “dmb st” (store/store) and “dmb sy” (full).</li>
				591	</ul>
				592
				593	<p>“Full barrier” means all four categories are included.</p>
				594
				595	<p>It is important to recognize that the only thing guaranteed by barrier
				596	instructions is ordering. Do not treat them as cache coherency “sync points” or
				597	synchronous “flush” instructions. The ARM “dmb” instruction has no direct
				598	effect on other cores. This is important to understand when trying to figure
				599	out where barrier instructions need to be issued.</p>
				600
				601
				602	<h4 id="addr_dep">Address dependencies and causal consistency</h4>
				603
				604	<p><em>(This is a slightly more advanced topic and can be skipped.)</em>
				605
				606	<p>The ARM CPU provides one special case where a load/load barrier can be
				607	avoided. Consider the following example from earlier, modified slightly:</p>
				608
				609	<table>
				610	<tr>
				611	<th>Thread 1</th>
				612	<th>Thread 2</th>
				613	</tr>
				614	<tr>
				615	<td><code>[A+8] = 41<br />
				616	<em>store/store barrier</em><br />
				617	B = 1 // “A is ready”</code></td>
				618	<td><code>loop:<br />
				619	reg0 = B<br />
				620	if (reg0 == 0) goto loop<br />
				621	reg1 = 8<br />
				622	reg2 = [A + reg1]</code></td>
				623	</tr>
				624	</table>
				625
				626	<p>This introduces a new notation. If “A” refers to a memory address, “A+n”
				627	refers to a memory address offset by 8 bytes from A. If A is the base address
				628	of an object or array, [A+8] could be a field in the object or an element in the
				629	array.</p>
				630
				631	<p>The “loop_until” seen in previous examples has been expanded to show the load
				632	of B into reg0. reg1 is assigned the numeric value 8, and reg2 is loaded from
Andy McFadden	9b6687b	2013-01-15 12:26:04 -0800	[diff] [blame]	633	the address [A+reg1] (the same location that thread 1 is accessing).</p>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	634
				635	<p>This will not behave correctly because the load from B could be observed
				636	after the load from [A+reg1]. We can fix this with a load/load barrier after
				637	the loop, but on ARM we can also just do this:</p>
				638
				639	<table>
				640	<tr>
				641	<th>Thread 1</th>
				642	<th>Thread 2</th>
				643	</tr>
				644	<tr>
Andy McFadden	9b6687b	2013-01-15 12:26:04 -0800	[diff] [blame]	645	<td><code>[A+8] = 41<br />
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	646	<em>store/store barrier</em><br />
				647	B = 1 // “A is ready”</code></td>
				648	<td><code>loop:<br />
				649	reg0 = B<br />
				650	if (reg0 == 0) goto loop<br />
				651	reg1 = 8 <strong>+ (reg0 & 0)</strong><br />
				652	reg2 = [A + reg1]</code></td>
				653	</tr>
				654	</table>
				655
				656	<p>What we’ve done here is change the assignment of reg1 from a constant (8) to
				657	a value that depends on what we loaded from B. In this case, we do a bitwise
				658	AND of the value with 0, which yields zero, which means reg1 still has the value
				659	8. However, the ARM CPU believes that the load from [A+reg1] depends upon the
				660	load from B, and will ensure that the two are observed in program order.</p>
				661
				662	<p>This is called an <em>address dependency</em>. Address dependencies exist
				663	when the value returned by a load is used to compute the address of a subsequent
				664	load or store. It can let you avoid the need for an explicit barrier in certain
				665	situations.</p>
				666
				667	<p>ARM does not provide <em>control dependency</em> guarantees. To illustrate
				668	this it’s necessary to dip into ARM code for a moment: <span
				669	style="font-size:.9em;color:#777">(<em><a href="#more"
				670	style="color:#777">Barrier Litmus Tests and Cookbook</a></em>)</span>.</p>
				671
				672	<pre>
				673	LDR r1, [r0]
				674	CMP r1, #55
				675	LDRNE r2, [r3]
				676	</pre>
				677
				678	<p>The loads from r0 and r3 may be observed out of order, even though the load
				679	from r3 will not execute at all if [r0] doesn’t hold 55. Inserting AND r1, r1,
				680	#0 and replacing the last instruction with LDRNE r2, [r3, r1] would ensure
				681	proper ordering without an explicit barrier. (This is a prime example of why
				682	you can’t think about consistency issues in terms of instruction execution.
				683	Always think in terms of memory accesses.)</p>
				684
				685	<p>While we’re hip-deep, it’s worth noting that ARM does not provide <em>causal
				686	consistency</em>:</p>
				687
				688	<table>
				689	<tr>
				690	<th>Thread 1</th>
				691	<th>Thread 2</th>
				692	<th>Thread 3</th>
				693	</tr>
				694	<tr>
				695	<td><code>A = 1</code></td>
				696	<td><code>loop_until (A == 1)<br />
				697	B = 1</code></td>
				698	<td><code>loop:<br />
				699	reg0 = B<br />
				700	if (reg0 == 0) goto loop<br />
				701	reg1 = reg0 & 0<br />
				702	reg2 = [A+reg1]</code></td>
				703	</tr>
				704	</table>
				705
				706	<p>Here, thread 1 sets A, signaling thread 2. Thread 2 sees that and sets B to
				707	signal thread 3. Thread 3 sees it and loads from A, using an address dependency
				708	to ensure that the load of B and the load of A are observed in program
				709	order.</p>
				710
				711	<p>It’s possible for reg2 to hold zero at the end of this. The fact that a
				712	store in thread 1 causes something to happen in thread 2 which causes something
				713	to happen in thread 3 does not mean that thread 3 will observe the stores in
				714	that order. (Inserting a load/store barrier in thread 2 fixes this.)</p>
				715
				716	<h4 id="membarrier_summary">Memory barrier summary</h4>
				717
				718	<p>Barriers come in different flavors for different situations. While there can
				719	be performance advantages to using exactly the right barrier type, there are
				720	code maintenance risks in doing so — unless the person updating the code
				721	fully understands it, they might introduce the wrong type of operation and cause
				722	a mysterious breakage. Because of this, and because ARM doesn’t provide a wide
				723	variety of barrier choices, many atomic primitives use full
				724	barrier instructions when a barrier is required.</p>
				725
				726	<p>The key thing to remember about barriers is that they define ordering. Don’t
				727	think of them as a “flush” call that causes a series of actions to happen.
				728	Instead, think of them as a dividing line in time for operations on the current
				729	CPU core.</p>
				730
				731
				732	<h3 id="atomic_ops">Atomic operations</h3>
				733
				734	<p>Atomic operations guarantee that an operation that requires a series of steps
				735	always behaves as if it were a single operation. For example, consider a
				736	non-atomic increment (“++A”) executed on the same variable by two threads
				737	simultaneously:</p>
				738
				739	<table>
				740	<tr>
				741	<th>Thread 1</th>
				742	<th>Thread 2</th>
				743	</tr>
				744	<tr>
				745	<td><code>reg = A<br />
				746	reg = reg + 1<br />
				747	A = reg</code></td>
				748	<td><code>reg = A<br />
				749	reg = reg + 1<br />
				750	A = reg</code></td>
				751	</tr>
				752	</table>
				753
				754	<p>If the threads execute concurrently from top to bottom, both threads will
				755	load 0 from A, increment it to 1, and store it back, leaving a final result of
				756	1. If we used an atomic increment operation, you would be guaranteed that the
				757	final result will be 2.</p>
				758
				759	<h4 id="atomic_essentials">Atomic essentials</h4>
				760
				761	<p>The most fundamental operations — loading and storing 32-bit values
				762	— are inherently atomic on ARM so long as the data is aligned on a 32-bit
				763	boundary. For example:</p>
				764
				765	<table>
				766	<tr>
				767	<th>Thread 1</th>
				768	<th>Thread 2</th>
				769	</tr>
				770	<tr>
				771	<td><code>reg = 0x00000000<br />
				772	A = reg</code></td>
				773	<td><code>reg = 0xffffffff<br />
				774	A = reg</code></td>
				775	</tr>
				776	</table>
				777
				778	<p>The CPU guarantees that A will hold 0x00000000 or 0xffffffff. It will never
				779	hold 0x0000ffff or any other partial “mix” of bytes.</p>
				780
				781	<div style="padding:.5em 2em;">
				782	<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
				783	<p>The atomicity guarantee is lost if the data isn’t aligned. Misaligned data
				784	could straddle a cache line, so other cores could see the halves update
				785	independently. Consequently, the ARMv7 documentation declares that it provides
				786	“single-copy atomicity” for all byte accesses, halfword accesses to
				787	halfword-aligned locations, and word accesses to word-aligned locations.
				788	Doubleword (64-bit) accesses are <strong>not</strong> atomic, unless the
				789	location is doubleword-aligned and special load/store instructions are used.
				790	This behavior is important to understand when multiple threads are performing
				791	unsynchronized updates to packed structures or arrays of primitive types.</p>
				792	</div>
				793	</div>
				794
				795	<p>There is no need for 32-bit “atomic read” or “atomic write” functions on ARM
				796	or x86. Where one is provided for completeness, it just does a trivial load or
				797	store.</p>
				798
				799	<p>Operations that perform more complex actions on data in memory are
				800	collectively known as <em>read-modify-write</em> (RMW) instructions, because
				801	they load data, modify it in some way, and write it back. CPUs vary widely in
				802	how these are implemented. ARM uses a technique called “Load Linked / Store
				803	Conditional”, or LL/SC.</p>
				804
				805	<div style="padding:.5em 2em;">
				806	<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
				807	<p>A <em>linked</em> or <em>locked</em> load reads the data from memory as
				808	usual, but also establishes a reservation, tagging the physical memory address.
				809	The reservation is cleared when another core tries to write to that address. To
				810	perform an LL/SC, the data is read with a reservation, modified, and then a
				811	conditional store instruction is used to try to write the data back. If the
				812	reservation is still in place, the store succeeds; if not, the store will fail.
				813	Atomic functions based on LL/SC usually loop, retrying the entire
				814	read-modify-write sequence until it completes without interruption.</p>
				815	</div>
				816	</div>
				817
				818	<p>It’s worth noting that the read-modify-write operations would not work
				819	correctly if they operated on stale data. If two cores perform an atomic
				820	increment on the same address, and one of them is not able to see what the other
				821	did because each core is reading and writing from local cache, the operation
				822	won’t actually be atomic. The CPU’s cache coherency rules ensure that the
				823	atomic RMW operations remain atomic in an SMP environment.</p>
				824
				825	<p>This should not be construed to mean that atomic RMW operations use a memory
				826	barrier. On ARM, atomics have no memory barrier semantics. While a series of
				827	atomic RMW operations on a single address will be observed in program order by
				828	other cores, there are no guarantees when it comes to the ordering of atomic and
				829	non-atomic operations.</p>
				830
				831	<p>It often makes sense to pair barriers and atomic operations together. The
				832	next section describes this in more detail.</p>
				833
				834	<h4 id="atomic_barrierpairing">Atomic + barrier pairing</h4>
				835
				836	<p>As usual, it’s useful to illuminate the discussion with an example. We’re
				837	going to consider a basic mutual-exclusion primitive called a <em>spin
				838	lock</em>. The idea is that a memory address (which we’ll call “lock”)
				839	initially holds zero. When a thread wants to execute code in the critical
				840	section, it sets the lock to 1, executes the critical code, and then changes it
				841	back to zero when done. If another thread has already set the lock to 1, we sit
				842	and spin until the lock changes back to zero.</p>
				843
				844	<p>To make this work we use an atomic RMW primitive called
				845	<em>compare-and-swap</em>. The function takes three arguments: the memory
				846	address, the expected current value, and the new value. If the value currently
				847	in memory matches what we expect, it is replaced with the new value, and the old
				848	value is returned. If the current value is not what we expect, we don’t change
				849	anything. A minor variation on this is called <em>compare-and-set</em>; instead
				850	of returning the old value it returns a boolean indicating whether the swap
				851	succeeded. For our needs either will work, but compare-and-set is slightly
				852	simpler for examples, so we use it and just refer to it as “CAS”.</p>
				853
				854	<p>The acquisition of the spin lock is written like this (using a C-like
				855	language):</p>
				856
				857	<pre>do {
				858	success = atomic_cas(&lock, 0, 1)
				859	} while (!success)
				860
				861	full_memory_barrier()
				862
				863	<em>critical-section</em></pre>
				864
				865	<p>If no thread holds the lock, the lock value will be 0, and the CAS operation
				866	will set it to 1 to indicate that we now have it. If another thread has it, the
				867	lock value will be 1, and the CAS operation will fail because the expected
				868	current value does not match the actual current value. We loop and retry.
				869	(Note this loop is on top of whatever loop the LL/SC code might be doing inside
				870	the atomic_cas function.)</p>
				871
				872	<div style="padding:.5em 2em;">
				873	<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
				874	<p>On SMP, a spin lock is a useful way to guard a small critical section. If we
				875	know that another thread is going to execute a handful of instructions and then
				876	release the lock, we can just burn a few cycles while we wait our turn.
				877	However, if the other thread happens to be executing on the same core, we’re
				878	just wasting time because the other thread can’t make progress until the OS
				879	schedules it again (either by migrating it to a different core or by preempting
				880	us). A proper spin lock implementation would optimistically spin a few times
				881	and then fall back on an OS primitive (such as a Linux futex) that allows the
				882	current thread to sleep while waiting for the other thread to finish up. On a
				883	uniprocessor you never want to spin at all. For the sake of brevity we’re
				884	ignoring all this.</p>
				885	</div>
				886	</div>
				887
				888	<p>The memory barrier is necessary to ensure that other threads observe the
				889	acquisition of the lock before they observe any loads or stores in the critical
				890	section. Without that barrier, the memory accesses could be observed while the
				891	lock is not held.</p>
				892
				893	<p>The <code>full_memory_barrier</code> call here actually does
				894	<strong>two</strong> independent operations. First, it issues the CPU’s full
				895	barrier instruction. Second, it tells the compiler that it is not allowed to
				896	reorder code around the barrier. That way, we know that the
				897	<code>atomic_cas</code> call will be executed before anything in the critical
				898	section. Without this <em>compiler reorder barrier</em>, the compiler has a
				899	great deal of freedom in how it generates code, and the order of instructions in
				900	the compiled code might be much different from the order in the source code.</p>
				901
				902	<p>Of course, we also want to make sure that none of the memory accesses
				903	performed in the critical section are observed after the lock is released. The
				904	full version of the simple spin lock is:</p>
				905
				906	<pre>do {
				907	success = atomic_cas(&lock, 0, 1) // acquire
				908	} while (!success)
				909	full_memory_barrier()
				910
				911	<em>critical-section</em>
				912
				913	full_memory_barrier()
				914	atomic_store(&lock, 0) // release</pre>
				915
				916	<p>We perform our second CPU/compiler memory barrier immediately
				917	<strong>before</strong> we release the lock, so that loads and stores in the
				918	critical section are observed before the release of the lock.</p>
				919
				920	<p>As mentioned earlier, the <code>atomic_store</code> operation is a simple
				921	assignment on ARM and x86. Unlike the atomic RMW operations, we don’t guarantee
				922	that other threads will see this value immediately. This isn’t a problem,
				923	though, because we only need to keep the other threads <strong>out</strong>. The
				924	other threads will stay out until they observe the store of 0. If it takes a
				925	little while for them to observe it, the other threads will spin a little
				926	longer, but we will still execute code correctly.</p>
				927
				928	<p>It’s convenient to combine the atomic operation and the barrier call into a
				929	single function. It also provides other advantages, which will become clear
				930	shortly.</p>
				931
				932
				933	<h4 id="acq_rel">Acquire and release</h4>
				934
				935	<p>When acquiring the spinlock, we issue the atomic CAS and then the barrier.
				936	When releasing the spinlock, we issue the barrier and then the atomic store.
				937	This inspires a particular naming convention: operations followed by a barrier
				938	are “acquiring” operations, while operations preceded by a barrier are
				939	“releasing” operations. (It would be wise to install the spin lock example
				940	firmly in mind, as the names are not otherwise intuitive.)</p>
				941
				942	<p>Rewriting the spin lock example with this in mind:</p>
				943
				944	<pre>do {
				945	success = atomic_<strong>acquire</strong>_cas(&lock, 0, 1)
				946	} while (!success)
				947
				948	<em>critical-section</em>
				949
				950	atomic_<strong>release</strong>_store(&lock, 0)</pre>
				951
				952	<p>This is a little more succinct and easier to read, but the real motivation
				953	for doing this lies in a couple of optimizations we can now perform.</p>
				954
				955	<p>First, consider <code>atomic_release_store</code>. We need to ensure that
				956	the store of zero to the lock word is observed after any loads or stores in the
				957	critical section above it. In other words, we need a load/store and store/store
				958	barrier. In an earlier section we learned that these aren’t necessary on x86
				959	SMP -- only store/load barriers are required. The implementation of
				960	<code>atomic_release_store</code> on x86 is therefore just a compiler reorder
				961	barrier followed by a simple store. No CPU barrier is required.</p>
				962
				963	<p>The second optimization mostly applies to the compiler (although some CPUs,
				964	such as the Itanium, can take advantage of it as well). The basic principle is
				965	that code can move across acquire and release barriers, but only in one
				966	direction.</p>
				967
				968	<p>Suppose we have a mix of locally-visible and globally-visible memory
				969	accesses, with some miscellaneous computation as well:</p>
				970
				971	<pre>local1 = arg1 / 41
				972	local2 = threadStruct->field2
				973	threadStruct->field3 = local2
				974
				975	do {
				976	success = atomic_acquire_cas(&lock, 0, 1)
				977	} while (!success)
				978
				979	local5 = globalStruct->field5
				980	globalStruct->field6 = local5
				981
				982	atomic_release_store(&lock, 0)</pre>
				983
				984	<p>Here we see two completely independent sets of operations. The first set
				985	operates on a thread-local data structure, so we’re not concerned about clashes
				986	with other threads. The second set operates on a global data structure, which
				987	must be protected with a lock.</p>
				988
				989	<p>A full compiler reorder barrier in the atomic ops will ensure that the
				990	program order matches the source code order at the lock boundaries. However,
				991	allowing the compiler to interleave instructions can improve performance. Loads
				992	from memory can be slow, but the CPU can continue to execute instructions that
				993	don’t require the result of that load while waiting for it to complete. The
				994	code might execute more quickly if it were written like this instead:</p>
				995
				996	<pre>do {
				997	success = atomic_acquire_cas(&lock, 0, 1)
				998	} while (!success)
				999
				1000	local2 = threadStruct->field2
				1001	local5 = globalStruct->field5
				1002	local1 = arg1 / 41
				1003	threadStruct->field3 = local2
				1004	globalStruct->field6 = local5
				1005
				1006	atomic_release_store(&lock, 0)</pre>
				1007
				1008	<p>We issue both loads, do some unrelated computation, and then execute the
				1009	instructions that make use of the loads. If the integer division takes less
				1010	time than one of the loads, we essentially get it for free, since it happens
				1011	during a period where the CPU would have stalled waiting for a load to
				1012	complete.</p>
				1013
				1014	<p>Note that <strong>all</strong> of the operations are now happening inside the
				1015	critical section. Since none of the “threadStruct” operations are visible
				1016	outside the current thread, nothing else can see them until we’re finished here,
				1017	so it doesn’t matter exactly when they happen.</p>
				1018
				1019	<p>In general, it is always safe to move operations <strong>into</strong> a
				1020	critical section, but never safe to move operations <strong>out of</strong> a
				1021	critical section. Put another way, you can migrate code “downward” across an
				1022	acquire barrier, and “upward” across a release barrier. If the atomic ops used
				1023	a full barrier, this sort of migration would not be possible.</p>
				1024
				1025	<p>Returning to an earlier point, we can state that on x86 all loads are
				1026	acquiring loads, and all stores are releasing stores. As a result:</p>
				1027
				1028	<ul>
				1029	<li>Loads may not be reordered with respect to each other. You can’t take a
				1030	load and move it “upward” across another load’s acquire barrier.</li>
				1031	<li>Stores may not be reordered with respect to each other, because you can’t
				1032	move a store “downward” across another store’s release barrier.</li>
				1033	<li>A load followed by a store can’t be reordered, because neither instruction
				1034	will tolerate it.</li>
				1035	<li>A store followed by a load <strong>can</strong> be reordered, because each
				1036	instruction can move across the other in that direction.</li>
				1037	</ul>
				1038
				1039	<p>Hence, you only need store/load barriers on x86 SMP.</p>
				1040
				1041	<p>Labeling atomic operations with “acquire” or “release” describes not only
				1042	whether the barrier is executed before or after the atomic operation, but also
				1043	how the compiler is allowed to reorder code.</p>
				1044
				1045	<h2 id="practice">Practice</h2>
				1046
				1047	<p>Debugging memory consistency problems can be very difficult. If a missing
				1048	memory barrier causes some code to read stale data, you may not be able to
				1049	figure out why by examining memory dumps with a debugger. By the time you can
				1050	issue a debugger query, the CPU cores will have all observed the full set of
				1051	accesses, and the contents of memory and the CPU registers will appear to be in
				1052	an “impossible” state.</p>
				1053
				1054	<h3 id="c_dont">What not to do in C</h3>
				1055
				1056	<p>Here we present some examples of incorrect code, along with simple ways to
				1057	fix them. Before we do that, we need to discuss the use of a basic language
				1058	feature.</p>
				1059
David Friedman	23bd592	2013-09-26 22:30:11 -0700	[diff] [blame]	1060	<h4 id="volatile">C/C++ and "volatile"</h4>
Dirk Dougherty	d589421	2012-11-28 18:53:10 -0800	[diff] [blame]	1061
				1062	<p>When writing single-threaded code, declaring a variable “volatile” can be
				1063	very useful. The compiler will not omit or reorder accesses to volatile
				1064	locations. Combine that with the sequential consistency provided by the
				1065	hardware, and you’re guaranteed that the loads and stores will appear to happen
				1066	in the expected order.</p>
				1067
				1068	<p>However, accesses to volatile storage may be reordered with non-volatile
				1069	accesses, so you have to be careful in multi-threaded uniprocessor environments
				1070	(explicit compiler reorder barriers may be required). There are no atomicity
				1071	guarantees, and no memory barrier provisions, so “volatile” doesn’t help you at
				1072	all in multi-threaded SMP environments. The C and C++ language standards are
				1073	being updated to address this with built-in atomic operations.</p>
				1074
				1075	<p>If you think you need to declare something “volatile”, that is a strong
				1076	indicator that you should be using one of the atomic operations instead.</p>
				1077
				1078	<h4 id="examplesc">Examples</h4>
				1079
				1080	<p>In most cases you’d be better off with a synchronization primitive (like a
				1081	pthread mutex) rather than an atomic operation, but we will employ the latter to
				1082	illustrate how they would be used in a practical situation.</p>
				1083
				1084	<p>For the sake of brevity we’re ignoring the effects of compiler optimizations
				1085	here — some of this code is broken even on uniprocessors — so for
				1086	all of these examples you must assume that the compiler generates
				1087	straightforward code (for example, compiled with gcc -O0). The fixes presented here do
				1088	solve both compiler-reordering and memory-access-ordering issues, but we’re only
				1089	going to discuss the latter.</p>
				1090
				1091	<pre>MyThing* gGlobalThing = NULL;
				1092
				1093	void initGlobalThing() // runs in thread 1
				1094	{
				1095	MyStruct* thing = malloc(sizeof(*thing));
				1096	memset(thing, 0, sizeof(*thing));
				1097	thing->x = 5;
				1098	thing->y = 10;
				1099	/* initialization complete, publish */
				1100	gGlobalThing = thing;
				1101	}
				1102
				1103	void useGlobalThing() // runs in thread 2
				1104	{
				1105	if (gGlobalThing != NULL) {
				1106	int i = gGlobalThing->x; // could be 5, 0, or uninitialized data
				1107	...
				1108	}
				1109	}</pre>
				1110
				1111	<p>The idea here is that we allocate a structure, initialize its fields, and at
				1112	the very end we “publish” it by storing it in a global variable. At that point,
				1113	any other thread can see it, but that’s fine since it’s fully initialized,
				1114	right? At least, it would be on x86 SMP or a uniprocessor (again, making the
				1115	erroneous assumption that the compiler outputs code exactly as we have it in the
				1116	source).</p>
				1117
				1118	<p>Without a memory barrier, the store to <code>gGlobalThing</code> could be observed before
				1119	the fields are initialized on ARM. Another thread reading from <code>thing->x</code> could
				1120	see 5, 0, or even uninitialized data.</p>
				1121
				1122	<p>This can be fixed by changing the last assignment to:</p>
				1123
				1124	<pre> atomic_release_store(&gGlobalThing, thing);</pre>
				1125
				1126	<p>That ensures that all other threads will observe the writes in the proper
				1127	order, but what about reads? In this case we should be okay on ARM, because the
				1128	address dependency rules will ensure that any loads from an offset of
				1129	<code>gGlobalThing</code> are observed after the load of
				1130	<code>gGlobalThing</code>. However, it’s unwise to rely on architectural
				1131	details, since it means your code will be very subtly unportable. The complete
				1132	fix also requires a barrier after the load:</p>
				1133
				1134	<pre> MyThing* thing = atomic_acquire_load(&gGlobalThing);
				1135	int i = thing->x;</pre>
				1136
				1137	<p>Now we know the ordering will be correct. This may seem like an awkward way
				1138	to write code, and it is, but that’s the price you pay for accessing data
				1139	structures from multiple threads without using locks. Besides, address
				1140	dependencies won’t always save us:</p>
				1141
				1142	<pre>MyThing gGlobalThing;
				1143
				1144	void initGlobalThing() // runs in thread 1
				1145	{
				1146	gGlobalThing.x = 5;
				1147	gGlobalThing.y = 10;
				1148	/* initialization complete */
				1149	gGlobalThing.initialized = true;
				1150	}
				1151
				1152	void useGlobalThing() // runs in thread 2
				1153	{
				1154	if (gGlobalThing.initialized) {
				1155	int i = gGlobalThing.x; // could be 5 or 0
				1156	}
				1157	}</pre>
				1158
				1159	<p>Because there is no relationship between the <code>initialized</code> field and the
				1160	others, the reads and writes can be observed out of order. (Note global data is
				1161	initialized to zero by the OS, so it shouldn’t be possible to read “random”
				1162	uninitialized data.)</p>
				1163
				1164	<p>We need to replace the store with:</p>
				1165	<pre> atomic_release_store(&gGlobalThing.initialized, true);</pre>
				1166
				1167	<p>and replace the load with:</p>
				1168	<pre> int initialized = atomic_acquire_load(&gGlobalThing.initialized);</pre>
				1169
				1170	<p>Another example of the same problem occurs when implementing
				1171	reference-counted data structures. The reference count itself will be
				1172	consistent so long as atomic increment and decrement operations are used, but
				1173	you can still run into trouble at the edges, for example:</p>
				1174
				1175	<pre>void RefCounted::release()
				1176	{
				1177	int oldCount = atomic_dec(&mRefCount);
				1178	if (oldCount == 1) { // was decremented to zero
				1179	recycleStorage();
				1180	}
				1181	}
				1182
				1183	void useSharedThing(RefCountedThing sharedThing)
				1184	{
				1185	int localVar = sharedThing->x;
				1186	sharedThing->release();
				1187	sharedThing = NULL; // can’t use this pointer any more
				1188	doStuff(localVar); // value of localVar might be wrong
				1189	}</pre>
				1190
				1191	<p>The <code>release()</code> call decrements the reference count using a
				1192	barrier-free atomic decrement operation. Because this is an atomic RMW
				1193	operation, we know that it will work correctly. If the reference count goes to
				1194	zero, we recycle the storage.</p>
				1195
				1196	<p>The <code>useSharedThing()</code> function extracts what it needs from
				1197	<code>sharedThing</code> and then releases its copy. However, because we didn’t
				1198	use a memory barrier, and atomic and non-atomic operations can be reordered,
				1199	it’s possible for other threads to observe the read of
				1200	<code>sharedThing->x</code> <strong>after</strong> they observe the recycle
				1201	operation. It’s therefore possible for <code>localVar</code> to hold a value
				1202	from "recycled" memory, for example a new object created in the same
				1203	location by another thread after <code>release()</code> is called.</p>
				1204
				1205	<p>This can be fixed by replacing the call to <code>atomic_dec()</code> with
				1206	<code>atomic_release_dec()</code>. The barrier ensures that the reads from
				1207	<code>sharedThing</code> are observed before we recycle the object.</p>
				1208
				1209	<div style="padding:.5em 2em;">
				1210	<div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
				1211	<p>In most cases the above won’t actually fail, because the “recycle” function
				1212	is likely guarded by functions that themselves employ barriers (libc heap
				1213	<code>free()</code>/<code>delete()</code>, or an object pool guarded by a
				1214	mutex). If the recycle function used a lock-free algorithm implemented without
				1215	barriers, however, the above code could fail on ARM SMP.</p>
				1216	</div>
				1217	</div>
				1218
				1219	<h3 id="j_dont">What not to do in Java</h3>
				1220
				1221	<p>We haven’t discussed some relevant Java language features, so we’ll take a
				1222	quick look at those first.</p>
				1223
				1224	<h4 id="sync_volatile">Java's "synchronized" and "volatile" keywords</h4>
				1225
				1226	<p>The “synchronized” keyword provides the Java language’s in-built locking
				1227	mechanism. Every object has an associated “monitor” that can be used to provide
				1228	mutually exclusive access.</p>
				1229
				1230	<p>The implementation of the “synchronized” block has the same basic structure
				1231	as the spin lock example: it begins with an acquiring CAS, and ends with a
				1232	releasing store. This means that compilers and code optimizers are free to
				1233	migrate code into a “synchronized” block. One practical consequence: you must
				1234	<strong>not</strong> conclude that code inside a synchronized block happens
				1235	after the stuff above it or before the stuff below it in a function. Going
				1236	further, if a method has two synchronized blocks that lock the same object, and
				1237	there are no operations in the intervening code that are observable by another
				1238	thread, the compiler may perform “lock coarsening” and combine them into a
				1239	single block.</p>
				1240
				1241	<p>The other relevant keyword is “volatile”. As defined in the specification
				1242	for Java 1.4 and earlier, a volatile declaration was about as weak as its C
				1243	counterpart. The spec for Java 1.5 was updated to provide stronger guarantees,
				1244	almost to the level of monitor synchronization.</p>
				1245
				1246	<p>The effects of volatile accesses can be illustrated with an example. If
				1247	thread 1 writes to a volatile field, and thread 2 subsequently reads from that
				1248	same field, then thread 2 is guaranteed to see that write and all writes
				1249	previously made by thread 1. More generally, the writes made by
				1250	<strong>any</strong> thread up to the point where it writes the field will be
				1251	visible to thead 2 when it does the read. In effect, writing to a volatile is
				1252	like a monitor release, and reading from a volatile is like a monitor
				1253	acquire.</p>
				1254
				1255	<p>Non-volatile accesses may be reorded with respect to volatile accesses in the
				1256	usual ways, for example the compiler could move a non-volatile load or store “above” a
				1257	volatile store, but couldn’t move it “below”. Volatile accesses may not be
				1258	reordered with respect to each other. The VM takes care of issuing the
				1259	appropriate memory barriers.</p>
				1260
				1261	<p>It should be mentioned that, while loads and stores of object references and
				1262	most primitive types are atomic, <code>long</code> and <code>double</code>
				1263	fields are not accessed atomically unless they are marked as volatile.
				1264	Multi-threaded updates to non-volatile 64-bit fields are problematic even on
				1265	uniprocessors.</p>
				1266
				1267	<h4 id="examplesj">Examples</h4>
				1268
				1269	<p>Here’s a simple, incorrect implementation of a monotonic counter: <span
				1270	style="font-size:.9em;color:#777">(<em><a href="#more" style="color:#777">Java
				1271	theory and practice: Managing volatility</a></em>)</span>.</p>
				1272
				1273	<pre>class Counter {
				1274	private int mValue;
				1275
				1276	public int get() {
				1277	return mValue;
				1278	}
				1279	public void incr() {
				1280	mValue++;
				1281	}
				1282	}</pre>
				1283
				1284	<p>Assume <code>get()</code> and <code>incr()</code> are called from multiple
				1285	threads, and we want to be sure that every thread sees the current count when
				1286	<code>get()</code> is called. The most glaring problem is that
				1287	<code>mValue++</code> is actually three operations:</p>
				1288
				1289	<ol>
				1290	<li><code>reg = mValue</code></li>
				1291	<li><code>reg = reg + 1</code></li>
				1292	<li><code>mValue = reg</code></li>
				1293	</ol>
				1294
				1295	<p>If two threads execute in <code>incr()</code> simultaneously, one of the
				1296	updates could be lost. To make the increment atomic, we need to declare
				1297	<code>incr()</code> “synchronized”. With this change, the code will run
				1298	correctly in multi-threaded uniprocessor environments.</p>
				1299
				1300	<p>It’s still broken on SMP, however. Different threads might see different
				1301	results from <code>get()</code>, because we’re reading the value with an ordinary load. We
				1302	can correct the problem by declaring <code>get()</code> to be synchronized.
				1303	With this change, the code is obviously correct.</p>
				1304
				1305	<p>Unfortunately, we’ve introduced the possibility of lock contention, which
				1306	could hamper performance. Instead of declaring <code>get()</code> to be
				1307	synchronized, we could declare <code>mValue</code> with “volatile”. (Note
				1308	<code>incr()</code> must still use <code>synchronize</code>.) Now we know that
				1309	the volatile write to <code>mValue</code> will be visible to any subsequent volatile read of
				1310	<code>mValue</code>. <code>incr()</code> will be slightly slower, but
				1311	<code>get()</code> will be faster, so even in the absence of contention this is
				1312	a win if reads outnumber writes. (See also {@link
				1313	java.util.concurrent.atomic.AtomicInteger}.)</p>
				1314
				1315	<p>Here’s another example, similar in form to the earlier C examples:</p>
				1316
				1317	<pre>class MyGoodies {
				1318	public int x, y;
				1319	}
				1320	class MyClass {
				1321	static MyGoodies sGoodies;
				1322
				1323	void initGoodies() { // runs in thread 1
				1324	MyGoodies goods = new MyGoodies();
				1325	goods.x = 5;
				1326	goods.y = 10;
				1327	sGoodies = goods;
				1328	}
				1329
				1330	void useGoodies() { // runs in thread 2
				1331	if (sGoodies != null) {
				1332	int i = sGoodies.x; // could be 5 or 0
				1333	....
				1334	}
				1335	}
				1336	}</pre>
				1337
				1338	<p>This has the same problem as the C code, namely that the assignment
				1339	<code>sGoodies = goods</code> might be observed before the initialization of the
				1340	fields in <code>goods</code>. If you declare <code>sGoodies</code> with the
				1341	volatile keyword, you can think about the loads as if they were
				1342	<code>atomic_acquire_load()</code> calls, and the stores as if they were
				1343	<code>atomic_release_store()</code> calls.</p>
				1344
				1345	<p>(Note that only the <code>sGoodies</code> reference itself is volatile. The
				1346	accesses to the fields inside it are not. The statement <code>z =
				1347	sGoodies.x</code> will perform a volatile load of <code>MyClass.sGoodies</code>
				1348	followed by a non-volatile load of <code>sGoodies.x</code>. If you make a local
				1349	reference <code>MyGoodies localGoods = sGoodies</code>, <code>z =
				1350	localGoods.x</code> will not perform any volatile loads.)</p>
				1351
				1352	<p>A more common idiom in Java programming is the infamous “double-checked
				1353	locking”:</p>
				1354
				1355	<pre>class MyClass {
				1356	private Helper helper = null;
				1357
				1358	public Helper getHelper() {
				1359	if (helper == null) {
				1360	synchronized (this) {
				1361	if (helper == null) {
				1362	helper = new Helper();
				1363	}
				1364	}
				1365	}
				1366	return helper;
				1367	}
				1368	}</pre>
				1369
				1370	<p>The idea is that we want to have a single instance of a <code>Helper</code>
				1371	object associated with an instance of <code>MyClass</code>. We must only create
				1372	it once, so we create and return it through a dedicated <code>getHelper()</code>
				1373	function. To avoid a race in which two threads create the instance, we need to
				1374	synchronize the object creation. However, we don’t want to pay the overhead for
				1375	the “synchronized” block on every call, so we only do that part if
				1376	<code>helper</code> is currently null.</p>
				1377
				1378	<p>This doesn’t work correctly on uniprocessor systems, unless you’re using a
				1379	traditional Java source compiler and an interpreter-only VM. Once you add fancy
				1380	code optimizers and JIT compilers it breaks down. See the “‘Double Checked
				1381	Locking is Broken’ Declaration” link in the appendix for more details, or Item
				1382	71 (“Use lazy initialization judiciously”) in Josh Bloch’s <em>Effective Java,
				1383	2nd Edition.</em>.</p>
				1384
				1385	<p>Running this on an SMP system introduces an additional way to fail. Consider
				1386	the same code rewritten slightly, as if it were compiled into a C-like language
				1387	(I’ve added a couple of integer fields to represent <code>Helper’s</code>
				1388	constructor activity):</p>
				1389
				1390	<pre>if (helper == null) {
				1391	// acquire monitor using spinlock
				1392	while (atomic_acquire_cas(&this.lock, 0, 1) != success)
				1393	;
				1394	if (helper == null) {
				1395	newHelper = malloc(sizeof(Helper));
				1396	newHelper->x = 5;
				1397	newHelper->y = 10;
				1398	helper = newHelper;
				1399	}
				1400	atomic_release_store(&this.lock, 0);
				1401	}</pre>
				1402
				1403	<p>Now the problem should be obvious: the store to <code>helper</code> is
				1404	happening before the memory barrier, which means another thread could observe
				1405	the non-null value of <code>helper</code> before the stores to the
				1406	<code>x</code>/<code>y</code> fields.</p>
				1407
				1408	<p>You could try to ensure that the store to <code>helper</code> happens after
				1409	the <code>atomic_release_store()</code> on <code>this.lock</code> by rearranging
				1410	the code, but that won’t help, because it’s okay to migrate code upward —
				1411	the compiler could move the assignment back above the
				1412	<code>atomic_release_store()</code> to its original position.</p>
				1413
				1414	<p>There are two ways to fix this:</p>
				1415	<ol>
				1416	<li>Do the simple thing and delete the outer check. This ensures that we never
				1417	examine the value of <code>helper</code> outside a synchronized block.</li>
				1418	<li>Declare <code>helper</code> volatile. With this one small change, the code
				1419	in Example J-3 will work correctly on Java 1.5 and later. (You may want to take
				1420	a minute to convince yourself that this is true.)</li>
				1421	</ol>
				1422
				1423	<p>This next example illustrates two important issues when using volatile:</p>
				1424
				1425	<pre>class MyClass {
				1426	int data1, data2;
				1427	volatile int vol1, vol2;
				1428
				1429	void setValues() { // runs in thread 1
				1430	data1 = 1;
				1431	vol1 = 2;
				1432	data2 = 3;
				1433	}
				1434
				1435	void useValues1() { // runs in thread 2
				1436	if (vol1 == 2) {
				1437	int l1 = data1; // okay
				1438	int l2 = data2; // wrong
				1439	}
				1440	}
				1441	void useValues2() { // runs in thread 2
				1442	int dummy = vol2;
				1443	int l1 = data1; // wrong
				1444	int l2 = data2; // wrong
				1445	}</pre>
				1446
				1447	<p>Looking at <code>useValues1()</code>, if thread 2 hasn’t yet observed the
				1448	update to <code>vol1</code>, then it can’t know if <code>data1</code> or
				1449	<code>data2</code> has been set yet. Once it sees the update to
				1450	<code>vol1</code>, it knows that the change to <code>data1</code> is also
				1451	visible, because that was made before <code>vol1</code> was changed. However,
				1452	it can’t make any assumptions about <code>data2</code>, because that store was
				1453	performed after the volatile store.</p>
				1454
				1455	<P>The code in <code>useValues2()</code> uses a second volatile field,
				1456	<code>vol2</code>, in an attempt to force the VM to generate a memory barrier.
				1457	This doesn’t generally work. To establish a proper “happens-before”
				1458	relationship, both threads need to be interacting with the same volatile field.
				1459	You’d have to know that <code>vol2</code> was set after <code>data1/data2</code>
				1460	in thread 1. (The fact that this doesn’t work is probably obvious from looking
				1461	at the code; the caution here is against trying to cleverly “cause” a memory
				1462	barrier instead of creating an ordered series of accesses.)</p>
				1463
				1464	<h3 id="bestpractice">What to do</h3>
				1465
				1466	<h4 id="advice">General advice</h4>
				1467
				1468	<p>In C/C++, use the <code>pthread</code> operations, like mutexes and
				1469	semaphores. These include the proper memory barriers, providing correct and
				1470	efficient behavior on all Android platform versions. Be sure to use them
				1471	correctly, for example be wary of signaling a condition variable without holding the
				1472	corresponding mutex.</p>
				1473
				1474	<p>It's best to avoid using atomic functions directly. Locking and
				1475	unlocking a pthread mutex require a single atomic operation each if there’s no
				1476	contention, so you’re not going to save much by replacing mutex calls with
				1477	atomic ops. If you need a lock-free design, you must fully understand the
				1478	concepts in this entire document before you begin (or, better yet, find an
				1479	existing code library that is known to be correct on SMP ARM).</p>
				1480
				1481	<p>Be extremely circumspect with "volatile” in C/C++. It often indicates a
				1482	concurrency problem waiting to happen.</p>
				1483
				1484	<p>In Java, the best answer is usually to use an appropriate utility class from
				1485	the {@link java.util.concurrent} package. The code is well written and well
				1486	tested on SMP.</p>
				1487
				1488	<p>Perhaps the safest thing you can do is make your class immutable. Objects
				1489	from classes like String and Integer hold data that cannot be changed once the
				1490	class is created, avoiding all synchronization issues. The book <em>Effective
				1491	Java, 2nd Ed.</em> has specific instructions in “Item 15: Minimize Mutability”. Note in
				1492	particular the importance of declaring fields “final" <span
				1493	style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Bloch</a>)</span>.</p>
				1494
				1495	<p>If neither of these options is viable, the Java “synchronized” statement
				1496	should be used to guard any field that can be accessed by more than one thread.
				1497	If mutexes won’t work for your situation, you should declare shared fields
				1498	“volatile”, but you must take great care to understand the interactions between
				1499	threads. The volatile declaration won’t save you from common concurrent
				1500	programming mistakes, but it will help you avoid the mysterious failures
				1501	associated with optimizing compilers and SMP mishaps.</p>
				1502
				1503	<p>The Java Memory Model guarantees that assignments to final fields are visible
				1504	to all threads once the constructor has finished — this is what ensures
				1505	proper synchronization of fields in immutable classes. This guarantee does not
				1506	hold if a partially-constructed object is allowed to become visible to other
				1507	threads. It is necessary to follow safe construction practices.<span
				1508	style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Safe
				1509	Construction Techniques in Java</a>)</span>.</p>
				1510
				1511	<h4 id="sync_guarantees">Synchronization primitive guarantees</h4>
				1512
				1513	<p>The pthread library and VM make a couple of useful guarantees: all accesses
				1514	previously performed by a thread that creates a new thread are observable by
				1515	that new thread as soon as it starts, and all accesses performed by a thread
				1516	that is exiting are observable when a <code>join()</code> on that thread
				1517	returns. This means you don’t need any additional synchronization when
				1518	preparing data for a new thread or examining the results of a joined thread.</p>
				1519
				1520	<p>Whether or not these guarantees apply to interactions with pooled threads
				1521	depends on the thread pool implementation.</p>
				1522
				1523	<p>In C/C++, the pthread library guarantees that any accesses made by a thread
				1524	before it unlocks a mutex will be observable by another thread after it locks
				1525	that same mutex. It also guarantees that any accesses made before calling
				1526	<code>signal()</code> or <code>broadcast()</code> on a condition variable will
				1527	be observable by the woken thread.</p>
				1528
				1529	<p>Java language threads and monitors make similar guarantees for the comparable
				1530	operations.</p>
				1531
				1532	<h4 id="ccpp_changes">Upcoming changes to C/C++</h4>
				1533
				1534	<p>The C and C++ language standards are evolving to include a sophisticated
				1535	collection of atomic operations. A full matrix of calls for common data types
				1536	is defined, with selectable memory barrier semantics (choose from relaxed,
				1537	consume, acquire, release, acq_rel, seq_cst).</p>
				1538
				1539	<p>See the <a href="#more">Further Reading</a> section for pointers to the
				1540	specifications.</p>
				1541
				1542
				1543	<h2 id="closing_notes">Closing Notes</h2>
				1544
				1545	<p>While this document does more than merely scratch the surface, it doesn’t
				1546	manage more than a shallow gouge. This is a very broad and deep topic. Some
				1547	areas for further exploration:</p>
				1548
				1549	<ul>
				1550	<li>Learn the definitions of <em>happens-before</em>,
				1551	<em>synchronizes-with</em>, and other essential concepts from the Java Memory
				1552	Model. (It’s hard to understand what “volatile” really means without getting
				1553	into this.)</li>
				1554	<li>Explore what compilers are and aren’t allowed to do when reordering code.
				1555	(The JSR-133 spec has some great examples of legal transformations that lead to
				1556	unexpected results.)</li>
				1557	<li>Find out how to write immutable classes in Java and C++. (There’s more to
				1558	it than just “don’t change anything after construction”.)</li>
				1559	<li>Internalize the recommendations in the Concurrency section of <em>Effective
				1560	Java, 2nd Edition.</em> (For example, you should avoid calling methods that are
				1561	meant to be overridden while inside a synchronized block.)</li>
				1562	<li>Understand what sorts of barriers you can use on x86 and ARM. (And other
				1563	CPUs for that matter, for example Itanium’s acquire/release instruction
				1564	modifiers.)</li>
				1565	<li>Read through the {@link java.util.concurrent} and {@link
				1566	java.util.concurrent.atomic} APIs to see what's available. Consider using
				1567	concurrency annotations like <code>@ThreadSafe</code> and
				1568	<code>@GuardedBy</code> (from net.jcip.annotations).</li>
				1569	</ul>
				1570
				1571	<p>The <a href="#more">Further Reading</a> section in the appendix has links to
				1572	documents and web sites that will better illuminate these topics.</p>
				1573
				1574	<h2 id="appendix">Appendix</h2>
				1575
				1576	<h3 id="smp_failure_example">SMP failure example</h3>
				1577
				1578	<p>This document describes a lot of “weird” things that can, in theory, happen.
				1579	If you’re not convinced that these issues are real, a practical example may be
				1580	useful.</p>
				1581
				1582	<p>Bill Pugh’s Java memory model <a
				1583	href="http://www.cs.umd.edu/~pugh/java/memoryModel/">web site</a> has a few
				1584	test programs on it. One interesting test is ReadAfterWrite.java, which does
				1585	the following:</p>
				1586
				1587	<table>
				1588	<tr>
				1589	<th>Thread 1</th>
				1590	<th>Thread 2</th>
				1591	</tr>
				1592	<tr>
				1593	<td><code>for (int i = 0; i < ITERATIONS; i++) {<br />
				1594	a = i;<br />
				1595	BB[i] = b;<br />
				1596	}</code></td>
				1597	<td><code>for (int i = 0; i < ITERATIONS; i++) {<br />
				1598	b = i;<br />
				1599	AA[i] = a;<br />
				1600	}</code></td>
				1601	</tr>
				1602	</table>
				1603
				1604	<p>Where <code>a</code> and <code>b</code> are declared as volatile
				1605	<code>int</code> fields, and <code>AA</code> and <code>BB</code> are ordinary
				1606	integer arrays.
				1607
				1608	<p>This is trying to determine if the VM ensures that, after a value is written
				1609	to a volatile, the next read from that volatile sees the new value. The test
				1610	code executes these loops a million or so times, and then runs through afterward
				1611	and searches the results for inconsistencies.</p>
				1612
				1613	<p>At the end of execution,<code>AA</code> and <code>BB</code> will be full of
				1614	gradually-increasing integers. The threads will not run side-by-side in a
				1615	predictable way, but we can assert a relationship between the array contents.
				1616	For example, consider this execution fragment:</p>
				1617
				1618	<table>
				1619	<tr>
				1620	<th>Thread 1</th>
				1621	<th>Thread 2</th>
				1622	</tr>
				1623	<tr>
				1624	<td><code>(initially a == 1534)<br />
				1625	a = 1535<br />
				1626	BB[1535] = 165<br />
				1627	a = 1536<br />
				1628	BB[1536] = 165<br />
				1629	<br />
				1630	<br />
				1631	<br />
				1632	<br />
				1633	a = 1537<br />
				1634	BB[1537] = 167</code></td>
				1635	<td><code>(initially b == 165)
				1636	<br />
				1637	<br />
				1638	<br />
				1639	<br />
				1640	<br />
				1641	b = 166<br />
				1642	AA[166] = 1536<br />
				1643	b = 167<br />
				1644	AA[167] = 1536<br />
				1645	<br /></code></td>
				1646	</tr>
				1647	</table>
				1648
				1649	<p>(This is written as if the threads were taking turns executing so that it’s
				1650	more obvious when results from one thread should be visible to the other, but in
				1651	practice that won’t be the case.)</p>
				1652
				1653	<p>Look at the assignment of <code>AA[166]</code> in thread 2. We are capturing
				1654	the fact that, at the point where thread 2 was on iteration 166, it can see that
				1655	thread 1 was on iteration 1536. If we look one step in the future, at thread
				1656	1’s iteration 1537, we expect to see that thread 1 saw that thread 2 was at
				1657	iteration 166 (or later). <code>BB[1537]</code> holds 167, so it appears things
				1658	are working.</p>
				1659
				1660	<p>Now suppose we fail to observe a volatile write to <code>b</code>:</p>
				1661
				1662	<table>
				1663	<tr>
				1664	<th>Thread 1</th>
				1665	<th>Thread 2</th>
				1666	</tr>
				1667	<tr>
				1668	<td><code>(initially a == 1534)<br />
				1669	a = 1535<br />
				1670	BB[1535] = 165<br />
				1671	a = 1536<br />
				1672	BB[1536] = 165<br />
				1673	<br />
				1674	<br />
				1675	<br />
				1676	<br />
				1677	a = 1537<br />
				1678	BB[1537] = 165 // stale b</code></td>
				1679	<td><code>(initially b == 165)<br />
				1680	<br />
				1681	<br />
				1682	<br />
				1683	<br />
				1684	b = 166<br />
				1685	AA[166] = 1536<br />
				1686	b = 167<br />
				1687	AA[167] = 1536</code></td>
				1688	</tr>
				1689	</table>
				1690
				1691	<p>Now, <code>BB[1537]</code> holds 165, a smaller value than we expected, so we
				1692	know we have a problem. Put succinctly, for i=166, BB[AA[i]+1] < i. (This also
				1693	catches failures by thread 2 to observe writes to <code>a</code>, for example if we
				1694	miss an update and assign <code>AA[166] = 1535</code>, we will get
				1695	<code>BB[AA[166]+1] == 165</code>.)</p>
				1696
				1697	<p>If you run the test program under Dalvik (Android 3.0 “Honeycomb” or later)
				1698	on an SMP ARM device, it will never fail. If you remove the word “volatile”
				1699	from the declarations of <code>a</code> and <code>b</code>, it will consistently
				1700	fail. The program is testing to see if the VM is providing sequentially
				1701	consistent ordering for accesses to <code>a</code> and <code>b</code>, so you
				1702	will only see correct behavior when the variables are volatile. (It will also
				1703	succeed if you run the code on a uniprocessor device, or run it while something
				1704	else is using enough CPU that the kernel doesn’t schedule the test threads on
				1705	separate cores.)</p>
				1706
				1707	<p>If you run the modified test a few times you will note that it doesn’t fail
				1708	in the same place every time. The test fails consistently because it performs
				1709	the operations a million times, and it only needs to see out-of-order accesses
				1710	once. In practice, failures will be infrequent and difficult to locate. This
				1711	test program could very well succeed on a broken VM if things just happen to
				1712	work out.</p>
				1713
				1714	<h3 id="sync_stores">Implementing synchronization stores</h3>
				1715
				1716	<p><em>(This isn’t something most programmers will find themselves implementing,
				1717	but the discussion is illuminating.)</em></p>
				1718
				1719	<p>Consider once again volatile accesses in Java. Earlier we made reference to
				1720	their similarities with acquiring loads and releasing stores, which works as a
				1721	starting point but doesn’t tell the full story.</p>
				1722
				1723	<p>We start with a fragment of Dekker’s algorithm. Initially both
				1724	<code>flag1</code> and <code>flag2</code> are false:</p>
				1725
				1726	<table>
				1727	<tr>
				1728	<th>Thread 1</th>
				1729	<th>Thread 2</th>
				1730	</tr>
				1731	<tr>
				1732	<td><code>flag1 = true<br />
				1733	if (flag2 == false)<br />
				1734	<em>critical-stuff</em></code></td>
				1735	<td><code>flag2 = true<br />
				1736	if (flag1 == false)<br />
				1737	<em>critical-stuff</em></code></td>
				1738	</tr>
				1739	</table
				1740
				1741	<p><code>flag1</code> and <code>flag2</code> are declared as volatile boolean
				1742	fields. The rules for acquiring loads and releasing stores would allow the
				1743	accesses in each thread to be reordered, breaking the algorithm. Fortunately,
				1744	the JMM has a few things to say here. Informally:</p>
				1745
				1746	<ul>
				1747	<li>A write to a volatile field <em>happens-before</em> every subsequent read of that
				1748	same field. (For this example, it means that if one thread updates a flag, and
				1749	later on the other thread reads that flag, the reader is guaranteed to see the
				1750	write.)</li>
				1751	<li>Every execution has a total order over all volatile field accesses. The
				1752	order is consistent with program order.</li>
				1753	</ul>
				1754
				1755	<p>Taken together, these rules say that the volatile accesses in our example
				1756	must be observable in program order by all threads. Thus, we will never see
				1757	these threads executing the “critical-stuff” simultaneously.</p>
				1758
				1759	<div style="padding:.5em 2em;">
				1760	<div style="border-left:4px solid #999;padding:0 1em;font-style:italic;">
				1761	<p>Another way to think about this is in terms of <em>data races</em>. A data race
				1762	occurs if two accesses to the same memory location by different threads are not
				1763	ordered, at least one of them stores to the memory location, and at least one of
				1764	them is not a synchronization action <span style="font-size:.9em;color:#777">(<a
				1765	href="#more" style="color:#777">Boehm and McKenney</a>)</span>. The memory model
				1766	declares that a program free of data races must behave as if executed by a
				1767	sequentially-consistent machine. Because both <code>flag1</code> and
				1768	<code>flag2</code> are volatile, and volatile accesses are considered
				1769	synchronization actions, there are no data races and this code must execute in a
				1770	sequentially consistent manner.</p>
				1771	</div>
				1772	</div>
				1773
				1774	<p>As we saw in an earlier section, we need to insert a store/load barrier
				1775	between the two operations. The code executed in the VM for a volatile access
				1776	will look something like this:</p>
				1777
				1778	<table>
				1779	<tr>
				1780	<th>volatile load</th>
				1781	<th>volatile store</th>
				1782	</tr>
				1783	<tr>
				1784	<td><code>reg = A<br />
				1785	<em>load/load + load/store barrier</em></code></td>
				1786	<td><code><em>store/store barrier</em><br />
				1787	A = reg<br />
				1788	<em>store/load barrier</em></code></td>
				1789	</tr>
				1790	</table>
				1791
				1792	<p>The volatile load is just an acquiring load. The volatile store is similar
				1793	to a releasing store, but we’ve omitted load/store from the pre-store barrier,
				1794	and added a store/load barrier afterward.</p>
				1795
				1796	<p>What we’re really trying to guarantee, though, is that (using thread 1 as an
				1797	example) the write to flag1 is observed before the read of flag2. We could
				1798	issue the store/load barrier before the volatile load instead and get the same
				1799	result, but because loads tend to outnumber stores it’s best to associate it
				1800	with the store.</p>
				1801
				1802	<p>On some architectures, it’s possible to implement volatile stores with an
				1803	atomic operation and skip the explicit store/load barrier. On x86, for example,
				1804	atomics provide a full barrier. The ARM LL/SC operations don’t include a
				1805	barrier, so for ARM we must use explicit barriers.</p>
				1806
				1807	<p>(Much of this is due to Doug Lea and his “JSR-133 Cookbook for Compiler
				1808	Writers” page.)</p>
				1809
				1810	<h3 id="more">Further reading</h3>
				1811
				1812	<p>Web pages and documents that provide greater depth or breadth. The more generally useful articles are nearer the top of the list.</p>
				1813
				1814	<dl>
				1815	<dt>Shared Memory Consistency Models: A Tutorial</dt>
				1816	<dd>Written in 1995 by Adve & Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models.
				1817	<br /><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</a></dd>
				1818
				1819	<dt>Memory Barriers</dt>
				1820	<dd>Nice little article summarizing the issues.
				1821	<br /><a href="http://en.wikipedia.org/wiki/Memory_barrier">http://en.wikipedia.org/wiki/Memory_barrier</a></dd>
				1822
				1823	<dt>Threads Basics</dt>
				1824	<dd>An introduction to multi-threaded programming in C++ and Java, by Hans Boehm. Excellent discussion of data races and basic synchronization methods.
				1825	<br /><a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html">http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html</a></dd>
				1826
				1827	<dt>Java Concurrency In Practice</dt>
				1828	<dd>Published in 2006, this book covers a wide range of topics in great detail. Highly recommended for anyone writing multi-threaded code in Java.
				1829	<br /><a href="http://www.javaconcurrencyinpractice.com">http://www.javaconcurrencyinpractice.com</a></dd>
				1830
				1831	<dt>JSR-133 (Java Memory Model) FAQ</dt>
				1832	<dd>A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields.
				1833	<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html">http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html</a></dd>
				1834
				1835	<dt>Overview of package java.util.concurrent</dt>
				1836	<dd>The documentation for the <code>java.util.concurrent</code> package. Near the bottom of the page is a section entitled “Memory Consistency Properties” that explains the guarantees made by the various classes.
				1837	<br />{@link java.util.concurrent java.util.concurrent} Package Summary</dd>
				1838
				1839	<dt>Java Theory and Practice: Safe Construction Techniques in Java</dt>
				1840	<dd>This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors.
				1841	<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp0618.html">http://www.ibm.com/developerworks/java/library/j-jtp0618.html</a></dd>
				1842
				1843	<dt>Java Theory and Practice: Managing Volatility</dt>
				1844	<dd>A nice article describing what you can and can’t accomplish with volatile fields in Java.
				1845	<br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp06197.html">http://www.ibm.com/developerworks/java/library/j-jtp06197.html</a></dd>
				1846
				1847	<dt>The “Double-Checked Locking is Broken” Declaration</dt>
				1848	<dd>Bill Pugh’s detailed explanation of the various ways in which double-checked locking is broken. Includes C/C++ and Java.
				1849	<br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html</a></dd>
				1850
				1851	<dt>[ARM] Barrier Litmus Tests and Cookbook</dt>
				1852	<dd>A discussion of ARM SMP issues, illuminated with short snippets of ARM code. If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this. Also describes the instructions used for memory barriers on executable code (possibly useful if you’re generating code on the fly).
				1853	<br /><a href="http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf">http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf</a></dd>
				1854
				1855	<dt>Linux Kernel Memory Barriers
				1856	<dd>Documentation for Linux kernel memory barriers. Includes some useful examples and ASCII art.
				1857	<br/><a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a></dd>
				1858
				1859	<dt>ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), chapter 29 (“Atomic operations library”)</dt>
				1860	<dd>Draft standard for C++ atomic operation features.
				1861	<br /><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf</a>
				1862	<br/ >(intro: <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf</a>)</dd>
				1863
				1864	<dt>ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (“Atomics <stdatomic.h>”)</dt>
				1865	<dd>Draft standard for ISO/IEC 9899-201x C atomic operation features. (See also n1484 for errata.)
				1866	<br /><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf</a></dd>
				1867
				1868	<dt>Dekker’s algorithm</dt>
				1869	<dd>The “first known correct solution to the mutual exclusion problem in concurrent programming”. The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware.
				1870	<br /><a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">http://en.wikipedia.org/wiki/Dekker's_algorithm</a></dd>
				1871
				1872	<dt>Comments on ARM vs. Alpha and address dependencies</dt>
				1873	<dd>An e-mail on the arm-kernel mailing list from Catalin Marinas. Includes a nice summary of address and control dependencies.
				1874	<br /><a href="http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html">http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html</a></dd>
				1875
				1876	<dt>What Every Programmer Should Know About Memory</dt>
				1877	<dd>A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper.
				1878	<br /><a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a></dd>
				1879
				1880	<dt>Reasoning about the ARM weakly consistent memory model</dt>
				1881	<dd>This paper was written by Chong & Ishtiaq of ARM, Ltd. It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion. The definition of “observability” used here comes from this paper.
				1882	<br /><a href="http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711">http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711</a></dd>
				1883
				1884	<dt>The JSR-133 Cookbook for Compiler Writers</dt>
				1885	<dd>Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation. It goes much deeper into the details than most people will need to worry about, but it provides good fodder for contemplation.
				1886	<br /><a href="http://g.oswego.edu/dl/jmm/cookbook.html">http://g.oswego.edu/dl/jmm/cookbook.html</a></dd>
				1887
				1888	<dt>The Semantics of Power and ARM Multiprocessor Machine Code</dt>
				1889	<dd>If you prefer your explanations in rigorous mathematical form, this is a fine place to go next.
				1890	<br /><a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf">http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf</a></dd>
				1891	</dl>