Blame - llvm/docs/Atomics.rst - toolchain/llvm-project

blob: 4961348d0c97d7915f13c3afa51511a97e1283c3 [file] [log] [blame]

Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	1	==============================================
				2	LLVM Atomic Instructions and Concurrency Guide
				3	==============================================
				4
				5	.. contents::
				6	:local:
				7
				8	Introduction
				9	============
				10
JF Bastien	86d8d87	2016-04-05 00:31:25 +0000	[diff] [blame]	11	LLVM supports instructions which are well-defined in the presence of threads and
				12	asynchronous signals.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	13
				14	The atomic instructions are designed specifically to provide readable IR and
				15	optimized code generation for the following:
				16
JF Bastien	86d8d87	2016-04-05 00:31:25 +0000	[diff] [blame]	17	* The C++11 ``<atomic>`` header. (`C++11 draft available here
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	18	<http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C11 draft available here
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	19	<http://www.open-std.org/jtc1/sc22/wg14/>`_.)
				20
				21	* Proper semantics for Java-style memory, for both ``volatile`` and regular
				22	shared variables. (`Java Specification
Benjamin Kramer	2abde4f	2014-08-04 09:26:40 +0000	[diff] [blame]	23	<http://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html>`_)
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	24
				25	* gcc-compatible ``__sync_*`` builtins. (`Description
Benjamin Kramer	2abde4f	2014-08-04 09:26:40 +0000	[diff] [blame]	26	<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_)
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	27
				28	* Other scenarios with atomic semantics, including ``static`` variables with
				29	non-trivial constructors in C++.
				30
				31	Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
				32	which ensures that every volatile load and store happens and is performed in the
				33	stated order. A couple examples: if a SequentiallyConsistent store is
				34	immediately followed by another SequentiallyConsistent store to the same
				35	address, the first store can be erased. This transformation is not allowed for a
				36	pair of volatile stores. On the other hand, a non-volatile non-atomic load can
				37	be moved across a volatile load freely, but not an Acquire load.
				38
				39	This document is intended to provide a guide to anyone either writing a frontend
				40	for LLVM or working on optimization passes for LLVM with a guide for how to deal
				41	with instructions with special semantics in the presence of concurrency. This
				42	is not intended to be a precise guide to the semantics; the details can get
				43	extremely complicated and unreadable, and are not usually necessary.
				44
				45	.. _Optimization outside atomic:
				46
				47	Optimization outside atomic
				48	===========================
				49
				50	The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
				51	lead to undefined results in a concurrent environment; see `NotAtomic`_. This
				52	section specifically goes into the one optimizer restriction which applies in
				53	concurrent environments, which gets a bit more of an extended description
				54	because any optimization dealing with stores needs to be aware of it.
				55
				56	From the optimizer's point of view, the rule is that if there are not any
				57	instructions with atomic ordering involved, concurrency does not matter, with
				58	one exception: if a variable might be visible to another thread or signal
				59	handler, a store cannot be inserted along a path where it might not execute
				60	otherwise. Take the following example:
				61
				62	.. code-block:: c
				63
				64	/* C code, for readability; run through clang -O2 -S -emit-llvm to get
				65	equivalent IR */
				66	int x;
				67	void f(int* a) {
				68	for (int i = 0; i < 100; i++) {
				69	if (a[i])
				70	x += 1;
				71	}
				72	}
				73
				74	The following is equivalent in non-concurrent situations:
				75
				76	.. code-block:: c
				77
				78	int x;
				79	void f(int* a) {
				80	int xtemp = x;
				81	for (int i = 0; i < 100; i++) {
				82	if (a[i])
				83	xtemp += 1;
				84	}
				85	x = xtemp;
				86	}
				87
				88	However, LLVM is not allowed to transform the former to the latter: it could
				89	indirectly introduce undefined behavior if another thread can access ``x`` at
				90	the same time. (This example is particularly of interest because before the
				91	concurrency model was implemented, LLVM would perform this transformation.)
				92
				93	Note that speculative loads are allowed; a load which is part of a race returns
				94	``undef``, but does not have undefined behavior.
				95
				96	Atomic instructions
				97	===================
				98
				99	For cases where simple loads and stores are not sufficient, LLVM provides
				100	various atomic instructions. The exact guarantees provided depend on the
				101	ordering; see `Atomic orderings`_.
				102
				103	``load atomic`` and ``store atomic`` provide the same basic functionality as
				104	non-atomic loads and stores, but provide additional guarantees in situations
				105	where threads and signals are involved.
				106
				107	``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
				108	atomic store (where the store is conditional for ``cmpxchg``), but no other
Tim Northover	420a216	2014-06-13 14:24:07 +0000	[diff] [blame]	109	memory operation can happen on any thread between the load and store.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	110
				111	A ``fence`` provides Acquire and/or Release ordering which is not part of
				112	another operation; it is normally used along with Monotonic memory operations.
				113	A Monotonic load followed by an Acquire fence is roughly equivalent to an
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	114	Acquire load, and a Monotonic store following a Release fence is roughly
				115	equivalent to a Release store. SequentiallyConsistent fences behave as both
				116	an Acquire and a Release fence, and offer some additional complicated
				117	guarantees, see the C++11 standard for details.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	118
				119	Frontends generating atomic instructions generally need to be aware of the
				120	target to some degree; atomic instructions are guaranteed to be lock-free, and
				121	therefore an instruction which is wider than the target natively supports can be
				122	impossible to generate.
				123
				124	.. _Atomic orderings:
				125
				126	Atomic orderings
				127	================
				128
				129	In order to achieve a balance between performance and necessary guarantees,
				130	there are six levels of atomicity. They are listed in order of strength; each
				131	level includes all the guarantees of the previous level except for
				132	Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
				133
				134	.. _NotAtomic:
				135
				136	NotAtomic
				137	---------
				138
				139	NotAtomic is the obvious, a load or store which is not atomic. (This isn't
				140	really a level of atomicity, but is listed here for comparison.) This is
				141	essentially a regular load or store. If there is a race on a given memory
				142	location, loads from that location return undef.
				143
				144	Relevant standard
				145	This is intended to match shared variables in C/C++, and to be used in any
				146	other context where memory access is necessary, and a race is impossible. (The
				147	precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
				148
				149	Notes for frontends
				150	The rule is essentially that all memory accessed with basic loads and stores
				151	by multiple threads should be protected by a lock or other synchronization;
				152	otherwise, you are likely to run into undefined behavior. If your frontend is
				153	for a "safe" language like Java, use Unordered to load and store any shared
				154	variable. Note that NotAtomic volatile loads and stores are not properly
				155	atomic; do not try to use them as a substitute. (Per the C/C++ standards,
				156	volatile does provide some limited guarantees around asynchronous signals, but
				157	atomics are generally a better solution.)
				158
				159	Notes for optimizers
				160	Introducing loads to shared variables along a codepath where they would not
				161	otherwise exist is allowed; introducing stores to shared variables is not. See
				162	`Optimization outside atomic`_.
				163
				164	Notes for code generation
				165	The one interesting restriction here is that it is not allowed to write to
				166	bytes outside of the bytes relevant to a store. This is mostly relevant to
				167	unaligned stores: it is not allowed in general to convert an unaligned store
				168	into two aligned stores of the same width as the unaligned store. Backends are
				169	also expected to generate an i8 store as an i8 store, and not an instruction
				170	which writes to surrounding bytes. (If you are writing a backend for an
				171	architecture which cannot satisfy these restrictions and cares about
Tanya Lattner	0d28f80	2015-08-05 03:51:17 +0000	[diff] [blame]	172	concurrency, please send an email to llvm-dev.)
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	173
				174	Unordered
				175	---------
				176
				177	Unordered is the lowest level of atomicity. It essentially guarantees that races
				178	produce somewhat sane results instead of having undefined behavior. It also
Jingyue Wu	c4725da	2014-09-23 17:35:28 +0000	[diff] [blame]	179	guarantees the operation to be lock-free, so it does not depend on the data
				180	being part of a special atomic structure or depend on a separate per-process
				181	global lock. Note that code generation will fail for unsupported atomic
				182	operations; if you need such an operation, use explicit locking.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	183
				184	Relevant standard
				185	This is intended to match the Java memory model for shared variables.
				186
				187	Notes for frontends
				188	This cannot be used for synchronization, but is useful for Java and other
				189	"safe" languages which need to guarantee that the generated code never
				190	exhibits undefined behavior. Note that this guarantee is cheap on common
				191	platforms for loads of a native width, but can be expensive or unavailable for
				192	wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
				193	languages would normally split a 64-bit store on ARM into two 32-bit unordered
				194	stores.)
				195
				196	Notes for optimizers
				197	In terms of the optimizer, this prohibits any transformation that transforms a
				198	single load into multiple loads, transforms a store into multiple stores,
				199	narrows a store, or stores a value which would not be stored otherwise. Some
				200	examples of unsafe optimizations are narrowing an assignment into a bitfield,
				201	rematerializing a load, and turning loads and stores into a memcpy
				202	call. Reordering unordered operations is safe, though, and optimizers should
				203	take advantage of that because unordered operations are common in languages
				204	that need them.
				205
				206	Notes for code generation
				207	These operations are required to be atomic in the sense that if you use
				208	unordered loads and unordered stores, a load cannot see a value which was
				209	never stored. A normal load or store instruction is usually sufficient, but
				210	note that an unordered load or store cannot be split into multiple
				211	instructions (or an instruction which does multiple memory operations, like
JF Bastien	e84854a	2013-06-18 23:07:16 +0000	[diff] [blame]	212	``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM).
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	213
				214	Monotonic
				215	---------
				216
				217	Monotonic is the weakest level of atomicity that can be used in synchronization
				218	primitives, although it does not provide any general synchronization. It
				219	essentially guarantees that if you take all the operations affecting a specific
				220	address, a consistent ordering exists.
				221
				222	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	223	This corresponds to the C++11/C11 ``memory_order_relaxed``; see those
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	224	standards for the exact definition.
				225
				226	Notes for frontends
				227	If you are writing a frontend which uses this directly, use with caution. The
				228	guarantees in terms of synchronization are very weak, so make sure these are
				229	only used in a pattern which you know is correct. Generally, these would
				230	either be used for atomic operations which do not protect other memory (like
				231	an atomic counter), or along with a ``fence``.
				232
				233	Notes for optimizers
				234	In terms of the optimizer, this can be treated as a read+write on the relevant
				235	memory location (and alias analysis will take advantage of that). In addition,
				236	it is legal to reorder non-atomic and Unordered loads around Monotonic
				237	loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
				238	operations are unlikely to be used in ways which would make those
				239	optimizations useful.
				240
				241	Notes for code generation
				242	Code generation is essentially the same as that for unordered for loads and
				243	stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
				244	to appear as a single operation.
				245
				246	Acquire
				247	-------
				248
				249	Acquire provides a barrier of the sort necessary to acquire a lock to access
				250	other memory with normal loads and stores.
				251
				252	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	253	This corresponds to the C++11/C11 ``memory_order_acquire``. It should also be
				254	used for C++11/C11 ``memory_order_consume``.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	255
				256	Notes for frontends
				257	If you are writing a frontend which uses this directly, use with caution.
				258	Acquire only provides a semantic guarantee when paired with a Release
				259	operation.
				260
				261	Notes for optimizers
				262	Optimizers not aware of atomics can treat this like a nothrow call. It is
				263	also possible to move stores from before an Acquire load or read-modify-write
				264	operation to after it, and move non-Acquire loads from before an Acquire
				265	operation to after it.
				266
				267	Notes for code generation
				268	Architectures with weak memory ordering (essentially everything relevant today
				269	except x86 and SPARC) require some sort of fence to maintain the Acquire
				270	semantics. The precise fences required varies widely by architecture, but for
				271	a simple implementation, most architectures provide a barrier which is strong
				272	enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
				273	such a fence after the equivalent Monotonic operation is sufficient to
				274	maintain Acquire semantics for a memory operation.
				275
				276	Release
				277	-------
				278
				279	Release is similar to Acquire, but with a barrier of the sort necessary to
				280	release a lock.
				281
				282	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	283	This corresponds to the C++11/C11 ``memory_order_release``.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	284
				285	Notes for frontends
				286	If you are writing a frontend which uses this directly, use with caution.
				287	Release only provides a semantic guarantee when paired with a Acquire
				288	operation.
				289
				290	Notes for optimizers
				291	Optimizers not aware of atomics can treat this like a nothrow call. It is
				292	also possible to move loads from after a Release store or read-modify-write
				293	operation to before it, and move non-Release stores from after an Release
				294	operation to before it.
				295
				296	Notes for code generation
				297	See the section on Acquire; a fence before the relevant operation is usually
				298	sufficient for Release. Note that a store-store fence is not sufficient to
				299	implement Release semantics; store-store fences are generally not exposed to
				300	IR because they are extremely difficult to use correctly.
				301
				302	AcquireRelease
				303	--------------
				304
				305	AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
				306	barrier (for fences and operations which both read and write memory).
				307
				308	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	309	This corresponds to the C++11/C11 ``memory_order_acq_rel``.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	310
				311	Notes for frontends
				312	If you are writing a frontend which uses this directly, use with caution.
				313	Acquire only provides a semantic guarantee when paired with a Release
				314	operation, and vice versa.
				315
				316	Notes for optimizers
Sylvestre Ledru	35521e2	2012-07-23 08:51:15 +0000	[diff] [blame]	317	In general, optimizers should treat this like a nothrow call; the possible
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	318	optimizations are usually not interesting.
				319
				320	Notes for code generation
				321	This operation has Acquire and Release semantics; see the sections on Acquire
				322	and Release.
				323
				324	SequentiallyConsistent
				325	----------------------
				326
				327	SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
				328	and Release semantics for stores. Additionally, it guarantees that a total
				329	ordering exists between all SequentiallyConsistent operations.
				330
				331	Relevant standard
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	332	This corresponds to the C++11/C11 ``memory_order_seq_cst``, Java volatile, and
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	333	the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
				334
				335	Notes for frontends
				336	If a frontend is exposing atomic operations, these are much easier to reason
				337	about for the programmer than other kinds of operations, and using them is
				338	generally a practical performance tradeoff.
				339
				340	Notes for optimizers
				341	Optimizers not aware of atomics can treat this like a nothrow call. For
				342	SequentiallyConsistent loads and stores, the same reorderings are allowed as
				343	for Acquire loads and Release stores, except that SequentiallyConsistent
				344	operations may not be reordered.
				345
				346	Notes for code generation
				347	SequentiallyConsistent loads minimally require the same barriers as Acquire
				348	operations and SequentiallyConsistent stores require Release
				349	barriers. Additionally, the code generator must enforce ordering between
				350	SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
				351	is usually done by emitting either a full fence before the loads or a full
				352	fence after the stores; which is preferred varies by architecture.
				353
				354	Atomics and IR optimization
				355	===========================
				356
				357	Predicates for optimizer writers to query:
				358
				359	* ``isSimple()``: A load or store which is not volatile or atomic. This is
				360	what, for example, memcpyopt would check for operations it might transform.
				361
				362	* ``isUnordered()``: A load or store which is not volatile and at most
				363	Unordered. This would be checked, for example, by LICM before hoisting an
				364	operation.
				365
				366	* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
				367	that they return true for any operation which is volatile or at least
				368	Monotonic.
				369
JF Bastien	800f87a	2016-04-06 21:19:33 +0000	[diff] [blame]	370	* ``isStrongerThan`` / ``isAtLeastOrStrongerThan``: These are predicates on
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	371	orderings. They can be useful for passes that are aware of atomics, for
				372	example to do DSE across a single atomic access, but not across a
				373	release-acquire pair (see MemoryDependencyAnalysis for an example of this)
				374
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	375	* Alias analysis: Note that AA will return ModRef for anything Acquire or
				376	Release, and for the address accessed by any Monotonic operation.
				377
				378	To support optimizing around atomic operations, make sure you are using the
				379	right predicates; everything should work if that is done. If your pass should
				380	optimize some atomic operations (Unordered operations in particular), make sure
				381	it doesn't replace an atomic load or store with a non-atomic operation.
				382
				383	Some examples of how optimizations interact with various kinds of atomic
				384	operations:
				385
				386	* ``memcpyopt``: An atomic operation cannot be optimized into part of a
				387	memcpy/memset, including unordered loads/stores. It can pull operations
				388	across some atomic operations.
				389
				390	* LICM: Unordered loads/stores can be moved out of a loop. It just treats
				391	monotonic operations like a read+write to a memory location, and anything
				392	stricter than that like a nothrow call.
				393
				394	* DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
				395	be DSE'ed in some cases, but it's tricky to reason about, and not especially
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	396	important. It is possible in some case for DSE to operate across a stronger
				397	atomic operation, but it is fairly tricky. DSE delegates this reasoning to
				398	MemoryDependencyAnalysis (which is also used by other passes like GVN).
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	399
				400	* Folding a load: Any atomic load from a constant global can be constant-folded,
David Majnemer	cbf614a	2016-06-15 00:19:09 +0000	[diff] [blame]	401	because it cannot be observed. Similar reasoning allows sroa with
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	402	atomic loads and stores.
				403
				404	Atomics and Codegen
				405	===================
				406
				407	Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
				408	On architectures which use barrier instructions for all atomic ordering (like
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	409	ARM), appropriate fences can be emitted by the AtomicExpand Codegen pass if
				410	``setInsertFencesForAtomic()`` was used.
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	411
				412	The MachineMemOperand for all atomic operations is currently marked as volatile;
				413	this is not correct in the IR sense of volatile, but CodeGen handles anything
				414	marked volatile very conservatively. This should get fixed at some point.
				415
James Y Knight	19f6cce	2016-04-12 20:18:48 +0000	[diff] [blame]	416	One very important property of the atomic operations is that if your backend
				417	supports any inline lock-free atomic operations of a given size, you should
				418	support ALL operations of that size in a lock-free manner.
				419
				420	When the target implements atomic ``cmpxchg`` or LL/SC instructions (as most do)
				421	this is trivial: all the other operations can be implemented on top of those
				422	primitives. However, on many older CPUs (e.g. ARMv5, SparcV8, Intel 80386) there
				423	are atomic load and store instructions, but no ``cmpxchg`` or LL/SC. As it is
				424	invalid to implement ``atomic load`` using the native instruction, but
				425	``cmpxchg`` using a library call to a function that uses a mutex, ``atomic
				426	load`` must also expand to a library call on such architectures, so that it
				427	can remain atomic with regards to a simultaneous ``cmpxchg``, by using the same
				428	mutex.
				429
				430	AtomicExpandPass can help with that: it will expand all atomic operations to the
				431	proper ``__atomic_*`` libcalls for any size above the maximum set by
				432	``setMaxAtomicSizeInBitsSupported`` (which defaults to 0).
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	433
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	434	On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
				435	generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
				436	fences generate an ``MFENCE``, other fences do not cause any code to be
James Y Knight	19f6cce	2016-04-12 20:18:48 +0000	[diff] [blame]	437	generated. ``cmpxchg`` uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	438	uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
				439	other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
				440	on the users of the result, some ``atomicrmw`` operations can be translated into
				441	operations like ``LOCK AND``, but that does not work in general.
				442
Tim Northover	420a216	2014-06-13 14:24:07 +0000	[diff] [blame]	443	On ARM (before v8), MIPS, and many other RISC architectures, Acquire, Release,
				444	and SequentiallyConsistent semantics require barrier instructions for every such
Bill Wendling	2908947	2012-06-29 09:00:01 +0000	[diff] [blame]	445	operation. Loads and stores generate normal instructions. ``cmpxchg`` and
				446	``atomicrmw`` can be represented using a loop with LL/SC-style instructions
				447	which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
Tim Northover	420a216	2014-06-13 14:24:07 +0000	[diff] [blame]	448	on ARM, etc.).
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	449
				450	It is often easiest for backends to use AtomicExpandPass to lower some of the
				451	atomic constructs. Here are some lowerings it can do:
Dan Liew	460e0f4	2014-10-03 12:28:48 +0000	[diff] [blame]	452
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	453	* cmpxchg -> loop with load-linked/store-conditional
Ahmed Bougacha	5246867	2015-09-11 17:08:28 +0000	[diff] [blame]	454	by overriding ``shouldExpandAtomicCmpXchgInIR()``, ``emitLoadLinked()``,
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	455	``emitStoreConditional()``
				456	* large loads/stores -> ll-sc/cmpxchg
				457	by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()``
James Y Knight	19f6cce	2016-04-12 20:18:48 +0000	[diff] [blame]	458	* strong atomic accesses -> monotonic accesses + fences by overriding
				459	``shouldInsertFencesForAtomic()``, ``emitLeadingFence()``, and
				460	``emitTrailingFence()``
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	461	* atomic rmw -> loop with cmpxchg or load-linked/store-conditional
				462	by overriding ``expandAtomicRMWInIR()``
James Y Knight	19f6cce	2016-04-12 20:18:48 +0000	[diff] [blame]	463	* expansion to __atomic_* libcalls for unsupported sizes.
Dan Liew	460e0f4	2014-10-03 12:28:48 +0000	[diff] [blame]	464
Robin Morisset	e83f59e	2014-10-03 01:04:20 +0000	[diff] [blame]	465	For an example of all of these, look at the ARM backend.
James Y Knight	19f6cce	2016-04-12 20:18:48 +0000	[diff] [blame]	466
				467	Libcalls: __atomic_*
				468	====================
				469
				470	There are two kinds of atomic library calls that are generated by LLVM. Please
				471	note that both sets of library functions somewhat confusingly share the names of
				472	builtin functions defined by clang. Despite this, the library functions are
				473	not directly related to the builtins: it is not the case that ``__atomic_*``
				474	builtins lower to ``__atomic_`` library calls and ``__sync_`` builtins lower
				475	to ``__sync_*`` library calls.
				476
				477	The first set of library functions are named ``__atomic_*``. This set has been
				478	"standardized" by GCC, and is described below. (See also `GCC's documentation
				479	<https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary>`_)
				480
				481	LLVM's AtomicExpandPass will translate atomic operations on data sizes above
				482	``MaxAtomicSizeInBitsSupported`` into calls to these functions.
				483
				484	There are four generic functions, which can be called with data of any size or
				485	alignment::
				486
				487	void __atomic_load(size_t size, void ptr, void ret, int ordering)
				488	void __atomic_store(size_t size, void ptr, void val, int ordering)
				489	void __atomic_exchange(size_t size, void ptr, void val, void *ret, int ordering)
				490	bool __atomic_compare_exchange(size_t size, void ptr, void expected, void *desired, int success_order, int failure_order)
				491
				492	There are also size-specialized versions of the above functions, which can only
				493	be used with naturally-aligned pointers of the appropriate size. In the
				494	signatures below, "N" is one of 1, 2, 4, 8, and 16, and "iN" is the appropriate
				495	integer type of that size; if no such integer type exists, the specialization
				496	cannot be used::
				497
				498	iN __atomic_load_N(iN *ptr, iN val, int ordering)
				499	void __atomic_store_N(iN *ptr, iN val, int ordering)
				500	iN __atomic_exchange_N(iN *ptr, iN val, int ordering)
				501	bool __atomic_compare_exchange_N(iN ptr, iN expected, iN desired, int success_order, int failure_order)
				502
				503	Finally there are some read-modify-write functions, which are only available in
				504	the size-specific variants (any other sizes use a ``__atomic_compare_exchange``
				505	loop)::
				506
				507	iN __atomic_fetch_add_N(iN *ptr, iN val, int ordering)
				508	iN __atomic_fetch_sub_N(iN *ptr, iN val, int ordering)
				509	iN __atomic_fetch_and_N(iN *ptr, iN val, int ordering)
				510	iN __atomic_fetch_or_N(iN *ptr, iN val, int ordering)
				511	iN __atomic_fetch_xor_N(iN *ptr, iN val, int ordering)
				512	iN __atomic_fetch_nand_N(iN *ptr, iN val, int ordering)
				513
				514	This set of library functions have some interesting implementation requirements
				515	to take note of:
				516
				517	- They support all sizes and alignments -- including those which cannot be
				518	implemented natively on any existing hardware. Therefore, they will certainly
				519	use mutexes in for some sizes/alignments.
				520
				521	- As a consequence, they cannot be shipped in a statically linked
				522	compiler-support library, as they have state which must be shared amongst all
				523	DSOs loaded in the program. They must be provided in a shared library used by
				524	all objects.
				525
				526	- The set of atomic sizes supported lock-free must be a superset of the sizes
				527	any compiler can emit. That is: if a new compiler introduces support for
				528	inline-lock-free atomics of size N, the ``__atomic_*`` functions must also have a
				529	lock-free implementation for size N. This is a requirement so that code
				530	produced by an old compiler (which will have called the ``__atomic_*`` function)
				531	interoperates with code produced by the new compiler (which will use native
				532	the atomic instruction).
				533
				534	Note that it's possible to write an entirely target-independent implementation
				535	of these library functions by using the compiler atomic builtins themselves to
				536	implement the operations on naturally-aligned pointers of supported sizes, and a
				537	generic mutex implementation otherwise.
				538
				539	Libcalls: __sync_*
				540	==================
				541
				542	Some targets or OS/target combinations can support lock-free atomics, but for
				543	various reasons, it is not practical to emit the instructions inline.
				544
				545	There's two typical examples of this.
				546
				547	Some CPUs support multiple instruction sets which can be swiched back and forth
				548	on function-call boundaries. For example, MIPS supports the MIPS16 ISA, which
				549	has a smaller instruction encoding than the usual MIPS32 ISA. ARM, similarly,
				550	has the Thumb ISA. In MIPS16 and earlier versions of Thumb, the atomic
				551	instructions are not encodable. However, those instructions are available via a
				552	function call to a function with the longer encoding.
				553
				554	Additionally, a few OS/target pairs provide kernel-supported lock-free
				555	atomics. ARM/Linux is an example of this: the kernel `provides
				556	<https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt>`_ a
				557	function which on older CPUs contains a "magically-restartable" atomic sequence
				558	(which looks atomic so long as there's only one CPU), and contains actual atomic
				559	instructions on newer multicore models. This sort of functionality can typically
				560	be provided on any architecture, if all CPUs which are missing atomic
				561	compare-and-swap support are uniprocessor (no SMP). This is almost always the
				562	case. The only common architecture without that property is SPARC -- SPARCV8 SMP
				563	systems were common, yet it doesn't support any sort of compare-and-swap
				564	operation.
				565
				566	In either of these cases, the Target in LLVM can claim support for atomics of an
				567	appropriate size, and then implement some subset of the operations via libcalls
				568	to a ``__sync_`` function. Such functions must* not use locks in their
				569	implementation, because unlike the ``__atomic_*`` routines used by
				570	AtomicExpandPass, these may be mixed-and-matched with native instructions by the
				571	target lowering.
				572
				573	Further, these routines do not need to be shared, as they are stateless. So,
				574	there is no issue with having multiple copies included in one binary. Thus,
				575	typically these routines are implemented by the statically-linked compiler
				576	runtime support library.
				577
				578	LLVM will emit a call to an appropriate ``__sync_*`` routine if the target
				579	ISelLowering code has set the corresponding ``ATOMIC_CMPXCHG``, ``ATOMIC_SWAP``,
				580	or ``ATOMIC_LOAD_*`` operation to "Expand", and if it has opted-into the
Sylvestre Ledru	7d54050	2016-07-02 19:28:40 +0000	[diff] [blame]	581	availability of those library functions via a call to ``initSyncLibcalls()``.
James Y Knight	19f6cce	2016-04-12 20:18:48 +0000	[diff] [blame]	582
				583	The full set of functions that may be called by LLVM is (for ``N`` being 1, 2,
				584	4, 8, or 16)::
				585
				586	iN __sync_val_compare_and_swap_N(iN *ptr, iN expected, iN desired)
				587	iN __sync_lock_test_and_set_N(iN *ptr, iN val)
				588	iN __sync_fetch_and_add_N(iN *ptr, iN val)
				589	iN __sync_fetch_and_sub_N(iN *ptr, iN val)
				590	iN __sync_fetch_and_and_N(iN *ptr, iN val)
				591	iN __sync_fetch_and_or_N(iN *ptr, iN val)
				592	iN __sync_fetch_and_xor_N(iN *ptr, iN val)
				593	iN __sync_fetch_and_nand_N(iN *ptr, iN val)
				594	iN __sync_fetch_and_max_N(iN *ptr, iN val)
				595	iN __sync_fetch_and_umax_N(iN *ptr, iN val)
				596	iN __sync_fetch_and_min_N(iN *ptr, iN val)
				597	iN __sync_fetch_and_umin_N(iN *ptr, iN val)
				598
				599	This list doesn't include any function for atomic load or store; all known
				600	architectures support atomic loads and stores directly (possibly by emitting a
				601	fence on either side of a normal load or store.)
				602
				603	There's also, somewhat separately, the possibility to lower ``ATOMIC_FENCE`` to
				604	``__sync_synchronize()``. This may happen or not happen independent of all the
				605	above, controlled purely by ``setOperationAction(ISD::ATOMIC_FENCE, ...)``.