Blame - docs/Atomics.rst - fp2-dev/platform/external/llvm

blob: 1bca53e2b17ecd5dcb90ed95ff33c6714a8c1f9f [file] [log] [blame]

Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	1	.. _atomics:
				2
				3	==============================================
				4	LLVM Atomic Instructions and Concurrency Guide
				5	==============================================
				6
				7	.. contents::
				8	:local:
				9
				10	Introduction
				11	============
				12
				13	Historically, LLVM has not had very strong support for concurrency; some minimal
				14	intrinsics were provided, and ``volatile`` was used in some cases to achieve
				15	rough semantics in the presence of concurrency. However, this is changing;
				16	there are now new instructions which are well-defined in the presence of threads
				17	and asynchronous signals, and the model for existing instructions has been
				18	clarified in the IR.
				19
				20	The atomic instructions are designed specifically to provide readable IR and
				21	optimized code generation for the following:
				22
				23	* The new C++0x ``<atomic>`` header. (`C++0x draft available here
				24	<http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C1x draft available here
				25	<http://www.open-std.org/jtc1/sc22/wg14/>`_.)
				26
				27	* Proper semantics for Java-style memory, for both ``volatile`` and regular
				28	shared variables. (`Java Specification
				29	<http://java.sun.com/docs/books/jls/third_edition/html/memory.html>`_)
				30
				31	* gcc-compatible ``__sync_*`` builtins. (`Description
				32	<http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html>`_)
				33
				34	* Other scenarios with atomic semantics, including ``static`` variables with
				35	non-trivial constructors in C++.
				36
				37	Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
				38	which ensures that every volatile load and store happens and is performed in the
				39	stated order. A couple examples: if a SequentiallyConsistent store is
				40	immediately followed by another SequentiallyConsistent store to the same
				41	address, the first store can be erased. This transformation is not allowed for a
				42	pair of volatile stores. On the other hand, a non-volatile non-atomic load can
				43	be moved across a volatile load freely, but not an Acquire load.
				44
				45	This document is intended to provide a guide to anyone either writing a frontend
				46	for LLVM or working on optimization passes for LLVM with a guide for how to deal
				47	with instructions with special semantics in the presence of concurrency. This
				48	is not intended to be a precise guide to the semantics; the details can get
				49	extremely complicated and unreadable, and are not usually necessary.
				50
				51	.. _Optimization outside atomic:
				52
				53	Optimization outside atomic
				54	===========================
				55
				56	The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
				57	lead to undefined results in a concurrent environment; see `NotAtomic`_. This
				58	section specifically goes into the one optimizer restriction which applies in
				59	concurrent environments, which gets a bit more of an extended description
				60	because any optimization dealing with stores needs to be aware of it.
				61
				62	From the optimizer's point of view, the rule is that if there are not any
				63	instructions with atomic ordering involved, concurrency does not matter, with
				64	one exception: if a variable might be visible to another thread or signal
				65	handler, a store cannot be inserted along a path where it might not execute
				66	otherwise. Take the following example:
				67
				68	.. code-block:: c
				69
				70	/* C code, for readability; run through clang -O2 -S -emit-llvm to get
				71	equivalent IR */
				72	int x;
				73	void f(int* a) {
				74	for (int i = 0; i < 100; i++) {
				75	if (a[i])
				76	x += 1;
				77	}
				78	}
				79
				80	The following is equivalent in non-concurrent situations:
				81
				82	.. code-block:: c
				83
				84	int x;
				85	void f(int* a) {
				86	int xtemp = x;
				87	for (int i = 0; i < 100; i++) {
				88	if (a[i])
				89	xtemp += 1;
				90	}
				91	x = xtemp;
				92	}
				93
				94	However, LLVM is not allowed to transform the former to the latter: it could
				95	indirectly introduce undefined behavior if another thread can access ``x`` at
				96	the same time. (This example is particularly of interest because before the
				97	concurrency model was implemented, LLVM would perform this transformation.)
				98
				99	Note that speculative loads are allowed; a load which is part of a race returns
				100	``undef``, but does not have undefined behavior.
				101
				102	Atomic instructions
				103	===================
				104
				105	For cases where simple loads and stores are not sufficient, LLVM provides
				106	various atomic instructions. The exact guarantees provided depend on the
				107	ordering; see `Atomic orderings`_.
				108
				109	``load atomic`` and ``store atomic`` provide the same basic functionality as
				110	non-atomic loads and stores, but provide additional guarantees in situations
				111	where threads and signals are involved.
				112
				113	``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
				114	atomic store (where the store is conditional for ``cmpxchg``), but no other
				115	memory operation can happen on any thread between the load and store. Note that
				116	LLVM's cmpxchg does not provide quite as many options as the C++0x version.
				117
				118	A ``fence`` provides Acquire and/or Release ordering which is not part of
				119	another operation; it is normally used along with Monotonic memory operations.
				120	A Monotonic load followed by an Acquire fence is roughly equivalent to an
				121	Acquire load.
				122
				123	Frontends generating atomic instructions generally need to be aware of the
				124	target to some degree; atomic instructions are guaranteed to be lock-free, and
				125	therefore an instruction which is wider than the target natively supports can be
				126	impossible to generate.
				127
				128	.. _Atomic orderings:
				129
				130	Atomic orderings
				131	================
				132
				133	In order to achieve a balance between performance and necessary guarantees,
				134	there are six levels of atomicity. They are listed in order of strength; each
				135	level includes all the guarantees of the previous level except for
				136	Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
				137
				138	.. _NotAtomic:
				139
				140	NotAtomic
				141	---------
				142
				143	NotAtomic is the obvious, a load or store which is not atomic. (This isn't
				144	really a level of atomicity, but is listed here for comparison.) This is
				145	essentially a regular load or store. If there is a race on a given memory
				146	location, loads from that location return undef.
				147
				148	Relevant standard
				149	This is intended to match shared variables in C/C++, and to be used in any
				150	other context where memory access is necessary, and a race is impossible. (The
				151	precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
				152
				153	Notes for frontends
				154	The rule is essentially that all memory accessed with basic loads and stores
				155	by multiple threads should be protected by a lock or other synchronization;
				156	otherwise, you are likely to run into undefined behavior. If your frontend is
				157	for a "safe" language like Java, use Unordered to load and store any shared
				158	variable. Note that NotAtomic volatile loads and stores are not properly
				159	atomic; do not try to use them as a substitute. (Per the C/C++ standards,
				160	volatile does provide some limited guarantees around asynchronous signals, but
				161	atomics are generally a better solution.)
				162
				163	Notes for optimizers
				164	Introducing loads to shared variables along a codepath where they would not
				165	otherwise exist is allowed; introducing stores to shared variables is not. See
				166	`Optimization outside atomic`_.
				167
				168	Notes for code generation
				169	The one interesting restriction here is that it is not allowed to write to
				170	bytes outside of the bytes relevant to a store. This is mostly relevant to
				171	unaligned stores: it is not allowed in general to convert an unaligned store
				172	into two aligned stores of the same width as the unaligned store. Backends are
				173	also expected to generate an i8 store as an i8 store, and not an instruction
				174	which writes to surrounding bytes. (If you are writing a backend for an
				175	architecture which cannot satisfy these restrictions and cares about
				176	concurrency, please send an email to llvmdev.)
				177
				178	Unordered
				179	---------
				180
				181	Unordered is the lowest level of atomicity. It essentially guarantees that races
				182	produce somewhat sane results instead of having undefined behavior. It also
				183	guarantees the operation to be lock-free, so it do not depend on the data being
				184	part of a special atomic structure or depend on a separate per-process global
				185	lock. Note that code generation will fail for unsupported atomic operations; if
				186	you need such an operation, use explicit locking.
				187
				188	Relevant standard
				189	This is intended to match the Java memory model for shared variables.
				190
				191	Notes for frontends
				192	This cannot be used for synchronization, but is useful for Java and other
				193	"safe" languages which need to guarantee that the generated code never
				194	exhibits undefined behavior. Note that this guarantee is cheap on common
				195	platforms for loads of a native width, but can be expensive or unavailable for
				196	wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
				197	languages would normally split a 64-bit store on ARM into two 32-bit unordered
				198	stores.)
				199
				200	Notes for optimizers
				201	In terms of the optimizer, this prohibits any transformation that transforms a
				202	single load into multiple loads, transforms a store into multiple stores,
				203	narrows a store, or stores a value which would not be stored otherwise. Some
				204	examples of unsafe optimizations are narrowing an assignment into a bitfield,
				205	rematerializing a load, and turning loads and stores into a memcpy
				206	call. Reordering unordered operations is safe, though, and optimizers should
				207	take advantage of that because unordered operations are common in languages
				208	that need them.
				209
				210	Notes for code generation
				211	These operations are required to be atomic in the sense that if you use
				212	unordered loads and unordered stores, a load cannot see a value which was
				213	never stored. A normal load or store instruction is usually sufficient, but
				214	note that an unordered load or store cannot be split into multiple
				215	instructions (or an instruction which does multiple memory operations, like
				216	``LDRD`` on ARM).
				217
				218	Monotonic
				219	---------
				220
				221	Monotonic is the weakest level of atomicity that can be used in synchronization
				222	primitives, although it does not provide any general synchronization. It
				223	essentially guarantees that if you take all the operations affecting a specific
				224	address, a consistent ordering exists.
				225
				226	Relevant standard
				227	This corresponds to the C++0x/C1x ``memory_order_relaxed``; see those
				228	standards for the exact definition.
				229
				230	Notes for frontends
				231	If you are writing a frontend which uses this directly, use with caution. The
				232	guarantees in terms of synchronization are very weak, so make sure these are
				233	only used in a pattern which you know is correct. Generally, these would
				234	either be used for atomic operations which do not protect other memory (like
				235	an atomic counter), or along with a ``fence``.
				236
				237	Notes for optimizers
				238	In terms of the optimizer, this can be treated as a read+write on the relevant
				239	memory location (and alias analysis will take advantage of that). In addition,
				240	it is legal to reorder non-atomic and Unordered loads around Monotonic
				241	loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
				242	operations are unlikely to be used in ways which would make those
				243	optimizations useful.
				244
				245	Notes for code generation
				246	Code generation is essentially the same as that for unordered for loads and
				247	stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
				248	to appear as a single operation.
				249
				250	Acquire
				251	-------
				252
				253	Acquire provides a barrier of the sort necessary to acquire a lock to access
				254	other memory with normal loads and stores.
				255
				256	Relevant standard
				257	This corresponds to the C++0x/C1x ``memory_order_acquire``. It should also be
				258	used for C++0x/C1x ``memory_order_consume``.
				259
				260	Notes for frontends
				261	If you are writing a frontend which uses this directly, use with caution.
				262	Acquire only provides a semantic guarantee when paired with a Release
				263	operation.
				264
				265	Notes for optimizers
				266	Optimizers not aware of atomics can treat this like a nothrow call. It is
				267	also possible to move stores from before an Acquire load or read-modify-write
				268	operation to after it, and move non-Acquire loads from before an Acquire
				269	operation to after it.
				270
				271	Notes for code generation
				272	Architectures with weak memory ordering (essentially everything relevant today
				273	except x86 and SPARC) require some sort of fence to maintain the Acquire
				274	semantics. The precise fences required varies widely by architecture, but for
				275	a simple implementation, most architectures provide a barrier which is strong
				276	enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
				277	such a fence after the equivalent Monotonic operation is sufficient to
				278	maintain Acquire semantics for a memory operation.
				279
				280	Release
				281	-------
				282
				283	Release is similar to Acquire, but with a barrier of the sort necessary to
				284	release a lock.
				285
				286	Relevant standard
				287	This corresponds to the C++0x/C1x ``memory_order_release``.
				288
				289	Notes for frontends
				290	If you are writing a frontend which uses this directly, use with caution.
				291	Release only provides a semantic guarantee when paired with a Acquire
				292	operation.
				293
				294	Notes for optimizers
				295	Optimizers not aware of atomics can treat this like a nothrow call. It is
				296	also possible to move loads from after a Release store or read-modify-write
				297	operation to before it, and move non-Release stores from after an Release
				298	operation to before it.
				299
				300	Notes for code generation
				301	See the section on Acquire; a fence before the relevant operation is usually
				302	sufficient for Release. Note that a store-store fence is not sufficient to
				303	implement Release semantics; store-store fences are generally not exposed to
				304	IR because they are extremely difficult to use correctly.
				305
				306	AcquireRelease
				307	--------------
				308
				309	AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
				310	barrier (for fences and operations which both read and write memory).
				311
				312	Relevant standard
				313	This corresponds to the C++0x/C1x ``memory_order_acq_rel``.
				314
				315	Notes for frontends
				316	If you are writing a frontend which uses this directly, use with caution.
				317	Acquire only provides a semantic guarantee when paired with a Release
				318	operation, and vice versa.
				319
				320	Notes for optimizers
Sylvestre Ledru	c8e41c5	2012-07-23 08:51:15 +0000	[diff] [blame]	321	In general, optimizers should treat this like a nothrow call; the possible
Bill Wendling	3bf24bd	2012-06-29 09:00:01 +0000	[diff] [blame]	322	optimizations are usually not interesting.
				323
				324	Notes for code generation
				325	This operation has Acquire and Release semantics; see the sections on Acquire
				326	and Release.
				327
				328	SequentiallyConsistent
				329	----------------------
				330
				331	SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
				332	and Release semantics for stores. Additionally, it guarantees that a total
				333	ordering exists between all SequentiallyConsistent operations.
				334
				335	Relevant standard
				336	This corresponds to the C++0x/C1x ``memory_order_seq_cst``, Java volatile, and
				337	the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
				338
				339	Notes for frontends
				340	If a frontend is exposing atomic operations, these are much easier to reason
				341	about for the programmer than other kinds of operations, and using them is
				342	generally a practical performance tradeoff.
				343
				344	Notes for optimizers
				345	Optimizers not aware of atomics can treat this like a nothrow call. For
				346	SequentiallyConsistent loads and stores, the same reorderings are allowed as
				347	for Acquire loads and Release stores, except that SequentiallyConsistent
				348	operations may not be reordered.
				349
				350	Notes for code generation
				351	SequentiallyConsistent loads minimally require the same barriers as Acquire
				352	operations and SequentiallyConsistent stores require Release
				353	barriers. Additionally, the code generator must enforce ordering between
				354	SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
				355	is usually done by emitting either a full fence before the loads or a full
				356	fence after the stores; which is preferred varies by architecture.
				357
				358	Atomics and IR optimization
				359	===========================
				360
				361	Predicates for optimizer writers to query:
				362
				363	* ``isSimple()``: A load or store which is not volatile or atomic. This is
				364	what, for example, memcpyopt would check for operations it might transform.
				365
				366	* ``isUnordered()``: A load or store which is not volatile and at most
				367	Unordered. This would be checked, for example, by LICM before hoisting an
				368	operation.
				369
				370	* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
				371	that they return true for any operation which is volatile or at least
				372	Monotonic.
				373
				374	* Alias analysis: Note that AA will return ModRef for anything Acquire or
				375	Release, and for the address accessed by any Monotonic operation.
				376
				377	To support optimizing around atomic operations, make sure you are using the
				378	right predicates; everything should work if that is done. If your pass should
				379	optimize some atomic operations (Unordered operations in particular), make sure
				380	it doesn't replace an atomic load or store with a non-atomic operation.
				381
				382	Some examples of how optimizations interact with various kinds of atomic
				383	operations:
				384
				385	* ``memcpyopt``: An atomic operation cannot be optimized into part of a
				386	memcpy/memset, including unordered loads/stores. It can pull operations
				387	across some atomic operations.
				388
				389	* LICM: Unordered loads/stores can be moved out of a loop. It just treats
				390	monotonic operations like a read+write to a memory location, and anything
				391	stricter than that like a nothrow call.
				392
				393	* DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
				394	be DSE'ed in some cases, but it's tricky to reason about, and not especially
				395	important.
				396
				397	* Folding a load: Any atomic load from a constant global can be constant-folded,
				398	because it cannot be observed. Similar reasoning allows scalarrepl with
				399	atomic loads and stores.
				400
				401	Atomics and Codegen
				402	===================
				403
				404	Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
				405	On architectures which use barrier instructions for all atomic ordering (like
				406	ARM), appropriate fences are split out as the DAG is built.
				407
				408	The MachineMemOperand for all atomic operations is currently marked as volatile;
				409	this is not correct in the IR sense of volatile, but CodeGen handles anything
				410	marked volatile very conservatively. This should get fixed at some point.
				411
				412	Common architectures have some way of representing at least a pointer-sized
				413	lock-free ``cmpxchg``; such an operation can be used to implement all the other
				414	atomic operations which can be represented in IR up to that size. Backends are
				415	expected to implement all those operations, but not operations which cannot be
				416	implemented in a lock-free manner. It is expected that backends will give an
				417	error when given an operation which cannot be implemented. (The LLVM code
				418	generator is not very helpful here at the moment, but hopefully that will
				419	change.)
				420
				421	The implementation of atomics on LL/SC architectures (like ARM) is currently a
				422	bit of a mess; there is a lot of copy-pasted code across targets, and the
				423	representation is relatively unsuited to optimization (it would be nice to be
				424	able to optimize loops involving cmpxchg etc.).
				425
				426	On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
				427	generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
				428	fences generate an ``MFENCE``, other fences do not cause any code to be
				429	generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
				430	uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
				431	other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
				432	on the users of the result, some ``atomicrmw`` operations can be translated into
				433	operations like ``LOCK AND``, but that does not work in general.
				434
				435	On ARM, MIPS, and many other RISC architectures, Acquire, Release, and
				436	SequentiallyConsistent semantics require barrier instructions for every such
				437	operation. Loads and stores generate normal instructions. ``cmpxchg`` and
				438	``atomicrmw`` can be represented using a loop with LL/SC-style instructions
				439	which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
				440	on ARM, etc.). At the moment, the IR does not provide any way to represent a
				441	weak ``cmpxchg`` which would not require a loop.