Blame - llvm/docs/Atomics.rst - toolchain/llvm-project

blob: 6c8303b2830db146233a47a18a21f4735acf41d6 [file] [log] [blame]

Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	1	==============================================
				2	LLVM Atomic Instructions and Concurrency Guide
				3	==============================================
				4
				5	.. contents::
				6	:local:
				7
				8	Introduction
				9	============
				10
				11	Historically, LLVM has not had very strong support for concurrency; some minimal
				12	intrinsics were provided, and ``volatile`` was used in some cases to achieve
				13	rough semantics in the presence of concurrency. However, this is changing;
				14	there are now new instructions which are well-defined in the presence of threads
				15	and asynchronous signals, and the model for existing instructions has been
				16	clarified in the IR.
				17
				18	The atomic instructions are designed specifically to provide readable IR and
				19	optimized code generation for the following:
				20
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	21	* The new C++11 ``<atomic>`` header. (`C++11 draft available here
				22	<http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C11 draft available here
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	23	<http://www.open-std.org/jtc1/sc22/wg14/>`_.)
				24
				25	* Proper semantics for Java-style memory, for both ``volatile`` and regular
				26	shared variables. (`Java Specification
Benjamin Kramer	2abde4f	2014-08-04 09:26:40 +0000	[diff] [blame]	27	<http://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html>`_)
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	28
				29	* gcc-compatible ``__sync_*`` builtins. (`Description
Benjamin Kramer	2abde4f	2014-08-04 09:26:40 +0000	[diff] [blame]	30	<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_)
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	31
				32	* Other scenarios with atomic semantics, including ``static`` variables with
				33	non-trivial constructors in C++.
				34
				35	Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
				36	which ensures that every volatile load and store happens and is performed in the
				37	stated order. A couple examples: if a SequentiallyConsistent store is
				38	immediately followed by another SequentiallyConsistent store to the same
				39	address, the first store can be erased. This transformation is not allowed for a
				40	pair of volatile stores. On the other hand, a non-volatile non-atomic load can
				41	be moved across a volatile load freely, but not an Acquire load.
				42
				43	This document is intended to provide a guide to anyone either writing a frontend
				44	for LLVM or working on optimization passes for LLVM with a guide for how to deal
				45	with instructions with special semantics in the presence of concurrency. This
				46	is not intended to be a precise guide to the semantics; the details can get
				47	extremely complicated and unreadable, and are not usually necessary.
				48
				49	.. _Optimization outside atomic:
				50
				51	Optimization outside atomic
				52	===========================
				53
				54	The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
				55	lead to undefined results in a concurrent environment; see `NotAtomic`_. This
				56	section specifically goes into the one optimizer restriction which applies in
				57	concurrent environments, which gets a bit more of an extended description
				58	because any optimization dealing with stores needs to be aware of it.
				59
				60	From the optimizer's point of view, the rule is that if there are not any
				61	instructions with atomic ordering involved, concurrency does not matter, with
				62	one exception: if a variable might be visible to another thread or signal
				63	handler, a store cannot be inserted along a path where it might not execute
				64	otherwise. Take the following example:
				65
				66	.. code-block:: c
				67
				68	/* C code, for readability; run through clang -O2 -S -emit-llvm to get
				69	equivalent IR */
				70	int x;
				71	void f(int* a) {
				72	for (int i = 0; i < 100; i++) {
				73	if (a[i])
				74	x += 1;
				75	}
				76	}
				77
				78	The following is equivalent in non-concurrent situations:
				79
				80	.. code-block:: c
				81
				82	int x;
				83	void f(int* a) {
				84	int xtemp = x;
				85	for (int i = 0; i < 100; i++) {
				86	if (a[i])
				87	xtemp += 1;
				88	}
				89	x = xtemp;
				90	}
				91
				92	However, LLVM is not allowed to transform the former to the latter: it could
				93	indirectly introduce undefined behavior if another thread can access ``x`` at
				94	the same time. (This example is particularly of interest because before the
				95	concurrency model was implemented, LLVM would perform this transformation.)
				96
				97	Note that speculative loads are allowed; a load which is part of a race returns
				98	``undef``, but does not have undefined behavior.
				99
				100	Atomic instructions
				101	===================
				102
				103	For cases where simple loads and stores are not sufficient, LLVM provides
				104	various atomic instructions. The exact guarantees provided depend on the
				105	ordering; see `Atomic orderings`_.
				106
				107	``load atomic`` and ``store atomic`` provide the same basic functionality as
				108	non-atomic loads and stores, but provide additional guarantees in situations
				109	where threads and signals are involved.
				110
				111	``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
				112	atomic store (where the store is conditional for ``cmpxchg``), but no other
Tim Northover	420a216	2014-06-13 14:24:07 +0000	[diff] [blame]	113	memory operation can happen on any thread between the load and store.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	114
				115	A ``fence`` provides Acquire and/or Release ordering which is not part of
				116	another operation; it is normally used along with Monotonic memory operations.
				117	A Monotonic load followed by an Acquire fence is roughly equivalent to an
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	118	Acquire load, and a Monotonic store following a Release fence is roughly
				119	equivalent to a Release store. SequentiallyConsistent fences behave as both
				120	an Acquire and a Release fence, and offer some additional complicated
				121	guarantees, see the C++11 standard for details.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	122
				123	Frontends generating atomic instructions generally need to be aware of the
				124	target to some degree; atomic instructions are guaranteed to be lock-free, and
				125	therefore an instruction which is wider than the target natively supports can be
				126	impossible to generate.
				127
				128	.. _Atomic orderings:
				129
				130	Atomic orderings
				131	================
				132
				133	In order to achieve a balance between performance and necessary guarantees,
				134	there are six levels of atomicity. They are listed in order of strength; each
				135	level includes all the guarantees of the previous level except for
				136	Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
				137
				138	.. _NotAtomic:
				139
				140	NotAtomic
				141	---------
				142
				143	NotAtomic is the obvious, a load or store which is not atomic. (This isn't
				144	really a level of atomicity, but is listed here for comparison.) This is
				145	essentially a regular load or store. If there is a race on a given memory
				146	location, loads from that location return undef.
				147
				148	Relevant standard
				149	This is intended to match shared variables in C/C++, and to be used in any
				150	other context where memory access is necessary, and a race is impossible. (The
				151	precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
				152
				153	Notes for frontends
				154	The rule is essentially that all memory accessed with basic loads and stores
				155	by multiple threads should be protected by a lock or other synchronization;
				156	otherwise, you are likely to run into undefined behavior. If your frontend is
				157	for a "safe" language like Java, use Unordered to load and store any shared
				158	variable. Note that NotAtomic volatile loads and stores are not properly
				159	atomic; do not try to use them as a substitute. (Per the C/C++ standards,
				160	volatile does provide some limited guarantees around asynchronous signals, but
				161	atomics are generally a better solution.)
				162
				163	Notes for optimizers
				164	Introducing loads to shared variables along a codepath where they would not
				165	otherwise exist is allowed; introducing stores to shared variables is not. See
				166	`Optimization outside atomic`_.
				167
				168	Notes for code generation
				169	The one interesting restriction here is that it is not allowed to write to
				170	bytes outside of the bytes relevant to a store. This is mostly relevant to
				171	unaligned stores: it is not allowed in general to convert an unaligned store
				172	into two aligned stores of the same width as the unaligned store. Backends are
				173	also expected to generate an i8 store as an i8 store, and not an instruction
				174	which writes to surrounding bytes. (If you are writing a backend for an
				175	architecture which cannot satisfy these restrictions and cares about
				176	concurrency, please send an email to llvmdev.)
				177
				178	Unordered
				179	---------
				180
				181	Unordered is the lowest level of atomicity. It essentially guarantees that races
				182	produce somewhat sane results instead of having undefined behavior. It also
Jingyue Wu	c4725da	2014-09-23 17:35:28 +0000	[diff] [blame]	183	guarantees the operation to be lock-free, so it does not depend on the data
				184	being part of a special atomic structure or depend on a separate per-process
				185	global lock. Note that code generation will fail for unsupported atomic
				186	operations; if you need such an operation, use explicit locking.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	187
				188	Relevant standard
				189	This is intended to match the Java memory model for shared variables.
				190
				191	Notes for frontends
				192	This cannot be used for synchronization, but is useful for Java and other
				193	"safe" languages which need to guarantee that the generated code never
				194	exhibits undefined behavior. Note that this guarantee is cheap on common
				195	platforms for loads of a native width, but can be expensive or unavailable for
				196	wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
				197	languages would normally split a 64-bit store on ARM into two 32-bit unordered
				198	stores.)
				199
				200	Notes for optimizers
				201	In terms of the optimizer, this prohibits any transformation that transforms a
				202	single load into multiple loads, transforms a store into multiple stores,
				203	narrows a store, or stores a value which would not be stored otherwise. Some
				204	examples of unsafe optimizations are narrowing an assignment into a bitfield,
				205	rematerializing a load, and turning loads and stores into a memcpy
				206	call. Reordering unordered operations is safe, though, and optimizers should
				207	take advantage of that because unordered operations are common in languages
				208	that need them.
				209
				210	Notes for code generation
				211	These operations are required to be atomic in the sense that if you use
				212	unordered loads and unordered stores, a load cannot see a value which was
				213	never stored. A normal load or store instruction is usually sufficient, but
				214	note that an unordered load or store cannot be split into multiple
				215	instructions (or an instruction which does multiple memory operations, like
JF Bastien	e84854a	2013-06-18 23:07:16 +0000	[diff] [blame]	216	``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM).
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	217
				218	Monotonic
				219	---------
				220
				221	Monotonic is the weakest level of atomicity that can be used in synchronization
				222	primitives, although it does not provide any general synchronization. It
				223	essentially guarantees that if you take all the operations affecting a specific
				224	address, a consistent ordering exists.
				225
				226	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	227	This corresponds to the C++11/C11 ``memory_order_relaxed``; see those
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	228	standards for the exact definition.
				229
				230	Notes for frontends
				231	If you are writing a frontend which uses this directly, use with caution. The
				232	guarantees in terms of synchronization are very weak, so make sure these are
				233	only used in a pattern which you know is correct. Generally, these would
				234	either be used for atomic operations which do not protect other memory (like
				235	an atomic counter), or along with a ``fence``.
				236
				237	Notes for optimizers
				238	In terms of the optimizer, this can be treated as a read+write on the relevant
				239	memory location (and alias analysis will take advantage of that). In addition,
				240	it is legal to reorder non-atomic and Unordered loads around Monotonic
				241	loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
				242	operations are unlikely to be used in ways which would make those
				243	optimizations useful.
				244
				245	Notes for code generation
				246	Code generation is essentially the same as that for unordered for loads and
				247	stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
				248	to appear as a single operation.
				249
				250	Acquire
				251	-------
				252
				253	Acquire provides a barrier of the sort necessary to acquire a lock to access
				254	other memory with normal loads and stores.
				255
				256	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	257	This corresponds to the C++11/C11 ``memory_order_acquire``. It should also be
				258	used for C++11/C11 ``memory_order_consume``.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	259
				260	Notes for frontends
				261	If you are writing a frontend which uses this directly, use with caution.
				262	Acquire only provides a semantic guarantee when paired with a Release
				263	operation.
				264
				265	Notes for optimizers
				266	Optimizers not aware of atomics can treat this like a nothrow call. It is
				267	also possible to move stores from before an Acquire load or read-modify-write
				268	operation to after it, and move non-Acquire loads from before an Acquire
				269	operation to after it.
				270
				271	Notes for code generation
				272	Architectures with weak memory ordering (essentially everything relevant today
				273	except x86 and SPARC) require some sort of fence to maintain the Acquire
				274	semantics. The precise fences required varies widely by architecture, but for
				275	a simple implementation, most architectures provide a barrier which is strong
				276	enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
				277	such a fence after the equivalent Monotonic operation is sufficient to
				278	maintain Acquire semantics for a memory operation.
				279
				280	Release
				281	-------
				282
				283	Release is similar to Acquire, but with a barrier of the sort necessary to
				284	release a lock.
				285
				286	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	287	This corresponds to the C++11/C11 ``memory_order_release``.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	288
				289	Notes for frontends
				290	If you are writing a frontend which uses this directly, use with caution.
				291	Release only provides a semantic guarantee when paired with a Acquire
				292	operation.
				293
				294	Notes for optimizers
				295	Optimizers not aware of atomics can treat this like a nothrow call. It is
				296	also possible to move loads from after a Release store or read-modify-write
				297	operation to before it, and move non-Release stores from after an Release
				298	operation to before it.
				299
				300	Notes for code generation
				301	See the section on Acquire; a fence before the relevant operation is usually
				302	sufficient for Release. Note that a store-store fence is not sufficient to
				303	implement Release semantics; store-store fences are generally not exposed to
				304	IR because they are extremely difficult to use correctly.
				305
				306	AcquireRelease
				307	--------------
				308
				309	AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
				310	barrier (for fences and operations which both read and write memory).
				311
				312	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	313	This corresponds to the C++11/C11 ``memory_order_acq_rel``.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	314
				315	Notes for frontends
				316	If you are writing a frontend which uses this directly, use with caution.
				317	Acquire only provides a semantic guarantee when paired with a Release
				318	operation, and vice versa.
				319
				320	Notes for optimizers
Sylvestre Ledru	35521e2	2012-07-23 08:51:15 +0000	[diff] [blame]	321	In general, optimizers should treat this like a nothrow call; the possible
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	322	optimizations are usually not interesting.
				323
				324	Notes for code generation
				325	This operation has Acquire and Release semantics; see the sections on Acquire
				326	and Release.
				327
				328	SequentiallyConsistent
				329	----------------------
				330
				331	SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
				332	and Release semantics for stores. Additionally, it guarantees that a total
				333	ordering exists between all SequentiallyConsistent operations.
				334
				335	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	336	This corresponds to the C++11/C11 ``memory_order_seq_cst``, Java volatile, and
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	337	the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
				338
				339	Notes for frontends
				340	If a frontend is exposing atomic operations, these are much easier to reason
				341	about for the programmer than other kinds of operations, and using them is
				342	generally a practical performance tradeoff.
				343
				344	Notes for optimizers
				345	Optimizers not aware of atomics can treat this like a nothrow call. For
				346	SequentiallyConsistent loads and stores, the same reorderings are allowed as
				347	for Acquire loads and Release stores, except that SequentiallyConsistent
				348	operations may not be reordered.
				349
				350	Notes for code generation
				351	SequentiallyConsistent loads minimally require the same barriers as Acquire
				352	operations and SequentiallyConsistent stores require Release
				353	barriers. Additionally, the code generator must enforce ordering between
				354	SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
				355	is usually done by emitting either a full fence before the loads or a full
				356	fence after the stores; which is preferred varies by architecture.
				357
				358	Atomics and IR optimization
				359	===========================
				360
				361	Predicates for optimizer writers to query:
				362
				363	* ``isSimple()``: A load or store which is not volatile or atomic. This is
				364	what, for example, memcpyopt would check for operations it might transform.
				365
				366	* ``isUnordered()``: A load or store which is not volatile and at most
				367	Unordered. This would be checked, for example, by LICM before hoisting an
				368	operation.
				369
				370	* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
				371	that they return true for any operation which is volatile or at least
				372	Monotonic.
				373
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	374	* ``isAtLeastAcquire()``/``isAtLeastRelease()``: These are predicates on
				375	orderings. They can be useful for passes that are aware of atomics, for
				376	example to do DSE across a single atomic access, but not across a
				377	release-acquire pair (see MemoryDependencyAnalysis for an example of this)
				378
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	379	* Alias analysis: Note that AA will return ModRef for anything Acquire or
				380	Release, and for the address accessed by any Monotonic operation.
				381
				382	To support optimizing around atomic operations, make sure you are using the
				383	right predicates; everything should work if that is done. If your pass should
				384	optimize some atomic operations (Unordered operations in particular), make sure
				385	it doesn't replace an atomic load or store with a non-atomic operation.
				386
				387	Some examples of how optimizations interact with various kinds of atomic
				388	operations:
				389
				390	* ``memcpyopt``: An atomic operation cannot be optimized into part of a
				391	memcpy/memset, including unordered loads/stores. It can pull operations
				392	across some atomic operations.
				393
				394	* LICM: Unordered loads/stores can be moved out of a loop. It just treats
				395	monotonic operations like a read+write to a memory location, and anything
				396	stricter than that like a nothrow call.
				397
				398	* DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
				399	be DSE'ed in some cases, but it's tricky to reason about, and not especially
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	400	important. It is possible in some case for DSE to operate across a stronger
				401	atomic operation, but it is fairly tricky. DSE delegates this reasoning to
				402	MemoryDependencyAnalysis (which is also used by other passes like GVN).
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	403
				404	* Folding a load: Any atomic load from a constant global can be constant-folded,
				405	because it cannot be observed. Similar reasoning allows scalarrepl with
				406	atomic loads and stores.
				407
				408	Atomics and Codegen
				409	===================
				410
				411	Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
				412	On architectures which use barrier instructions for all atomic ordering (like
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	413	ARM), appropriate fences can be emitted by the AtomicExpand Codegen pass if
				414	``setInsertFencesForAtomic()`` was used.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	415
				416	The MachineMemOperand for all atomic operations is currently marked as volatile;
				417	this is not correct in the IR sense of volatile, but CodeGen handles anything
				418	marked volatile very conservatively. This should get fixed at some point.
				419
				420	Common architectures have some way of representing at least a pointer-sized
				421	lock-free ``cmpxchg``; such an operation can be used to implement all the other
				422	atomic operations which can be represented in IR up to that size. Backends are
				423	expected to implement all those operations, but not operations which cannot be
				424	implemented in a lock-free manner. It is expected that backends will give an
				425	error when given an operation which cannot be implemented. (The LLVM code
				426	generator is not very helpful here at the moment, but hopefully that will
				427	change.)
				428
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	429	On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
				430	generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
				431	fences generate an ``MFENCE``, other fences do not cause any code to be
				432	generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
				433	uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
				434	other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
				435	on the users of the result, some ``atomicrmw`` operations can be translated into
				436	operations like ``LOCK AND``, but that does not work in general.
				437
Tim Northover	420a216	2014-06-13 14:24:07 +0000	[diff] [blame]	438	On ARM (before v8), MIPS, and many other RISC architectures, Acquire, Release,
				439	and SequentiallyConsistent semantics require barrier instructions for every such
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	440	operation. Loads and stores generate normal instructions. ``cmpxchg`` and
				441	``atomicrmw`` can be represented using a loop with LL/SC-style instructions
				442	which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
Tim Northover	420a216	2014-06-13 14:24:07 +0000	[diff] [blame]	443	on ARM, etc.).
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	444
				445	It is often easiest for backends to use AtomicExpandPass to lower some of the
				446	atomic constructs. Here are some lowerings it can do:
Dan Liew	460e0f4	2014-10-03 12:28:48 +0000	[diff] [blame]	447
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	448	* cmpxchg -> loop with load-linked/store-conditional
				449	by overriding ``hasLoadLinkedStoreConditional()``, ``emitLoadLinked()``,
				450	``emitStoreConditional()``
				451	* large loads/stores -> ll-sc/cmpxchg
				452	by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()``
				453	* strong atomic accesses -> monotonic accesses + fences
				454	by using ``setInsertFencesForAtomic()`` and overriding ``emitLeadingFence()``
				455	and ``emitTrailingFence()``
				456	* atomic rmw -> loop with cmpxchg or load-linked/store-conditional
				457	by overriding ``expandAtomicRMWInIR()``
Dan Liew	460e0f4	2014-10-03 12:28:48 +0000	[diff] [blame]	458
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	459	For an example of all of these, look at the ARM backend.