docs/Atomics.rst - fp2-dev/platform/external/llvm - Gitiles

 ==============================================
 LLVM Atomic Instructions and Concurrency Guide
 ==============================================

 .. contents::
    :local:

 Introduction
 ============

 Historically, LLVM has not had very strong support for concurrency; some minimal
 intrinsics were provided, and ``volatile`` was used in some cases to achieve
 rough semantics in the presence of concurrency.  However, this is changing;
 there are now new instructions which are well-defined in the presence of threads
 and asynchronous signals, and the model for existing instructions has been
 clarified in the IR.

 The atomic instructions are designed specifically to provide readable IR and
 optimized code generation for the following:

 * The new C++0x ``<atomic>`` header.  (`C++0x draft available here
   <http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C1x draft available here
   <http://www.open-std.org/jtc1/sc22/wg14/>`_.)

 * Proper semantics for Java-style memory, for both ``volatile`` and regular
   shared variables. (`Java Specification
   <http://java.sun.com/docs/books/jls/third_edition/html/memory.html>`_)

 * gcc-compatible ``__sync_*`` builtins. (`Description
   <http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html>`_)

 * Other scenarios with atomic semantics, including ``static`` variables with
   non-trivial constructors in C++.

 Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
 which ensures that every volatile load and store happens and is performed in the
 stated order.  A couple examples: if a SequentiallyConsistent store is
 immediately followed by another SequentiallyConsistent store to the same
 address, the first store can be erased. This transformation is not allowed for a
 pair of volatile stores. On the other hand, a non-volatile non-atomic load can
 be moved across a volatile load freely, but not an Acquire load.

 This document is intended to provide a guide to anyone either writing a frontend
 for LLVM or working on optimization passes for LLVM with a guide for how to deal
 with instructions with special semantics in the presence of concurrency.  This
 is not intended to be a precise guide to the semantics; the details can get
 extremely complicated and unreadable, and are not usually necessary.

 .. _Optimization outside atomic:

 Optimization outside atomic
 ===========================

 The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
 lead to undefined results in a concurrent environment; see `NotAtomic`_. This
 section specifically goes into the one optimizer restriction which applies in
 concurrent environments, which gets a bit more of an extended description
 because any optimization dealing with stores needs to be aware of it.

 From the optimizer's point of view, the rule is that if there are not any
 instructions with atomic ordering involved, concurrency does not matter, with
 one exception: if a variable might be visible to another thread or signal
 handler, a store cannot be inserted along a path where it might not execute
 otherwise.  Take the following example:

 .. code-block:: c

  /* C code, for readability; run through clang -O2 -S -emit-llvm to get
      equivalent IR */
   int x;
   void f(int* a) {
     for (int i = 0; i < 100; i++) {
       if (a[i])
         x += 1;
     }
   }

 The following is equivalent in non-concurrent situations:

 .. code-block:: c

   int x;
   void f(int* a) {
     int xtemp = x;
     for (int i = 0; i < 100; i++) {
       if (a[i])
         xtemp += 1;
     }
     x = xtemp;
   }

 However, LLVM is not allowed to transform the former to the latter: it could
 indirectly introduce undefined behavior if another thread can access ``x`` at
 the same time. (This example is particularly of interest because before the
 concurrency model was implemented, LLVM would perform this transformation.)

 Note that speculative loads are allowed; a load which is part of a race returns
 ``undef``, but does not have undefined behavior.

 Atomic instructions
 ===================

 For cases where simple loads and stores are not sufficient, LLVM provides
 various atomic instructions. The exact guarantees provided depend on the
 ordering; see `Atomic orderings`_.

 ``load atomic`` and ``store atomic`` provide the same basic functionality as
 non-atomic loads and stores, but provide additional guarantees in situations
 where threads and signals are involved.

 ``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
 atomic store (where the store is conditional for ``cmpxchg``), but no other
 memory operation can happen on any thread between the load and store.  Note that
 LLVM's cmpxchg does not provide quite as many options as the C++0x version.

 A ``fence`` provides Acquire and/or Release ordering which is not part of
 another operation; it is normally used along with Monotonic memory operations.
 A Monotonic load followed by an Acquire fence is roughly equivalent to an
 Acquire load.

 Frontends generating atomic instructions generally need to be aware of the
 target to some degree; atomic instructions are guaranteed to be lock-free, and
 therefore an instruction which is wider than the target natively supports can be
 impossible to generate.

 .. _Atomic orderings:

 Atomic orderings
 ================

 In order to achieve a balance between performance and necessary guarantees,
 there are six levels of atomicity. They are listed in order of strength; each
 level includes all the guarantees of the previous level except for
 Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)

 .. _NotAtomic:

 NotAtomic
 ---------

 NotAtomic is the obvious, a load or store which is not atomic. (This isn't
 really a level of atomicity, but is listed here for comparison.) This is
 essentially a regular load or store. If there is a race on a given memory
 location, loads from that location return undef.

 Relevant standard
   This is intended to match shared variables in C/C++, and to be used in any
   other context where memory access is necessary, and a race is impossible. (The
   precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)

 Notes for frontends
   The rule is essentially that all memory accessed with basic loads and stores
   by multiple threads should be protected by a lock or other synchronization;
   otherwise, you are likely to run into undefined behavior. If your frontend is
   for a "safe" language like Java, use Unordered to load and store any shared
   variable.  Note that NotAtomic volatile loads and stores are not properly
   atomic; do not try to use them as a substitute. (Per the C/C++ standards,
   volatile does provide some limited guarantees around asynchronous signals, but
   atomics are generally a better solution.)

 Notes for optimizers
   Introducing loads to shared variables along a codepath where they would not
   otherwise exist is allowed; introducing stores to shared variables is not. See
   `Optimization outside atomic`_.

 Notes for code generation
   The one interesting restriction here is that it is not allowed to write to
   bytes outside of the bytes relevant to a store.  This is mostly relevant to
   unaligned stores: it is not allowed in general to convert an unaligned store
   into two aligned stores of the same width as the unaligned store. Backends are
   also expected to generate an i8 store as an i8 store, and not an instruction
   which writes to surrounding bytes.  (If you are writing a backend for an
   architecture which cannot satisfy these restrictions and cares about
   concurrency, please send an email to llvmdev.)

 Unordered
 ---------

 Unordered is the lowest level of atomicity. It essentially guarantees that races
 produce somewhat sane results instead of having undefined behavior.  It also
 guarantees the operation to be lock-free, so it do not depend on the data being
 part of a special atomic structure or depend on a separate per-process global
 lock.  Note that code generation will fail for unsupported atomic operations; if
 you need such an operation, use explicit locking.

 Relevant standard
   This is intended to match the Java memory model for shared variables.

 Notes for frontends
   This cannot be used for synchronization, but is useful for Java and other
   "safe" languages which need to guarantee that the generated code never
   exhibits undefined behavior. Note that this guarantee is cheap on common
   platforms for loads of a native width, but can be expensive or unavailable for
   wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
   languages would normally split a 64-bit store on ARM into two 32-bit unordered
   stores.)

 Notes for optimizers
   In terms of the optimizer, this prohibits any transformation that transforms a
   single load into multiple loads, transforms a store into multiple stores,
   narrows a store, or stores a value which would not be stored otherwise.  Some
   examples of unsafe optimizations are narrowing an assignment into a bitfield,
   rematerializing a load, and turning loads and stores into a memcpy
   call. Reordering unordered operations is safe, though, and optimizers should
   take advantage of that because unordered operations are common in languages
   that need them.

 Notes for code generation
   These operations are required to be atomic in the sense that if you use
   unordered loads and unordered stores, a load cannot see a value which was
   never stored.  A normal load or store instruction is usually sufficient, but
   note that an unordered load or store cannot be split into multiple
   instructions (or an instruction which does multiple memory operations, like
   ``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM).

 Monotonic
 ---------

 Monotonic is the weakest level of atomicity that can be used in synchronization
 primitives, although it does not provide any general synchronization. It
 essentially guarantees that if you take all the operations affecting a specific
 address, a consistent ordering exists.

 Relevant standard
   This corresponds to the C++0x/C1x ``memory_order_relaxed``; see those
   standards for the exact definition.

 Notes for frontends
   If you are writing a frontend which uses this directly, use with caution.  The
   guarantees in terms of synchronization are very weak, so make sure these are
   only used in a pattern which you know is correct.  Generally, these would
   either be used for atomic operations which do not protect other memory (like
   an atomic counter), or along with a ``fence``.

 Notes for optimizers
   In terms of the optimizer, this can be treated as a read+write on the relevant
   memory location (and alias analysis will take advantage of that). In addition,
   it is legal to reorder non-atomic and Unordered loads around Monotonic
   loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
   operations are unlikely to be used in ways which would make those
   optimizations useful.

 Notes for code generation
   Code generation is essentially the same as that for unordered for loads and
   stores.  No fences are required.  ``cmpxchg`` and ``atomicrmw`` are required
   to appear as a single operation.

 Acquire
 -------

 Acquire provides a barrier of the sort necessary to acquire a lock to access
 other memory with normal loads and stores.

 Relevant standard
   This corresponds to the C++0x/C1x ``memory_order_acquire``. It should also be
   used for C++0x/C1x ``memory_order_consume``.

 Notes for frontends
   If you are writing a frontend which uses this directly, use with caution.
   Acquire only provides a semantic guarantee when paired with a Release
   operation.

 Notes for optimizers
   Optimizers not aware of atomics can treat this like a nothrow call.  It is
   also possible to move stores from before an Acquire load or read-modify-write
   operation to after it, and move non-Acquire loads from before an Acquire
   operation to after it.

 Notes for code generation
   Architectures with weak memory ordering (essentially everything relevant today
   except x86 and SPARC) require some sort of fence to maintain the Acquire
   semantics.  The precise fences required varies widely by architecture, but for
   a simple implementation, most architectures provide a barrier which is strong
   enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.).  Putting
   such a fence after the equivalent Monotonic operation is sufficient to
   maintain Acquire semantics for a memory operation.

 Release
 -------

 Release is similar to Acquire, but with a barrier of the sort necessary to
 release a lock.

 Relevant standard
   This corresponds to the C++0x/C1x ``memory_order_release``.

 Notes for frontends
   If you are writing a frontend which uses this directly, use with caution.
   Release only provides a semantic guarantee when paired with a Acquire
   operation.

 Notes for optimizers
   Optimizers not aware of atomics can treat this like a nothrow call.  It is
   also possible to move loads from after a Release store or read-modify-write
   operation to before it, and move non-Release stores from after an Release
   operation to before it.

 Notes for code generation
   See the section on Acquire; a fence before the relevant operation is usually
   sufficient for Release. Note that a store-store fence is not sufficient to
   implement Release semantics; store-store fences are generally not exposed to
   IR because they are extremely difficult to use correctly.

 AcquireRelease
 --------------

 AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
 barrier (for fences and operations which both read and write memory).

 Relevant standard
   This corresponds to the C++0x/C1x ``memory_order_acq_rel``.

 Notes for frontends
   If you are writing a frontend which uses this directly, use with caution.
   Acquire only provides a semantic guarantee when paired with a Release
   operation, and vice versa.

 Notes for optimizers
   In general, optimizers should treat this like a nothrow call; the possible
   optimizations are usually not interesting.

 Notes for code generation
   This operation has Acquire and Release semantics; see the sections on Acquire
   and Release.

 SequentiallyConsistent
 ----------------------

 SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
 and Release semantics for stores. Additionally, it guarantees that a total
 ordering exists between all SequentiallyConsistent operations.

 Relevant standard
   This corresponds to the C++0x/C1x ``memory_order_seq_cst``, Java volatile, and
   the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.

 Notes for frontends
   If a frontend is exposing atomic operations, these are much easier to reason
   about for the programmer than other kinds of operations, and using them is
   generally a practical performance tradeoff.

 Notes for optimizers
   Optimizers not aware of atomics can treat this like a nothrow call.  For
   SequentiallyConsistent loads and stores, the same reorderings are allowed as
   for Acquire loads and Release stores, except that SequentiallyConsistent
   operations may not be reordered.

 Notes for code generation
   SequentiallyConsistent loads minimally require the same barriers as Acquire
   operations and SequentiallyConsistent stores require Release
   barriers. Additionally, the code generator must enforce ordering between
   SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
   is usually done by emitting either a full fence before the loads or a full
   fence after the stores; which is preferred varies by architecture.

 Atomics and IR optimization
 ===========================

 Predicates for optimizer writers to query:

 * ``isSimple()``: A load or store which is not volatile or atomic.  This is
   what, for example, memcpyopt would check for operations it might transform.

 * ``isUnordered()``: A load or store which is not volatile and at most
   Unordered. This would be checked, for example, by LICM before hoisting an
   operation.

 * ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
   that they return true for any operation which is volatile or at least
   Monotonic.

 * Alias analysis: Note that AA will return ModRef for anything Acquire or
   Release, and for the address accessed by any Monotonic operation.

 To support optimizing around atomic operations, make sure you are using the
 right predicates; everything should work if that is done.  If your pass should
 optimize some atomic operations (Unordered operations in particular), make sure
 it doesn't replace an atomic load or store with a non-atomic operation.

 Some examples of how optimizations interact with various kinds of atomic
 operations:

 * ``memcpyopt``: An atomic operation cannot be optimized into part of a
   memcpy/memset, including unordered loads/stores.  It can pull operations
   across some atomic operations.

 * LICM: Unordered loads/stores can be moved out of a loop.  It just treats
   monotonic operations like a read+write to a memory location, and anything
   stricter than that like a nothrow call.

 * DSE: Unordered stores can be DSE'ed like normal stores.  Monotonic stores can
   be DSE'ed in some cases, but it's tricky to reason about, and not especially
   important.

 * Folding a load: Any atomic load from a constant global can be constant-folded,
   because it cannot be observed.  Similar reasoning allows scalarrepl with
   atomic loads and stores.

 Atomics and Codegen
 ===================

 Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
 On architectures which use barrier instructions for all atomic ordering (like
 ARM), appropriate fences are split out as the DAG is built.

 The MachineMemOperand for all atomic operations is currently marked as volatile;
 this is not correct in the IR sense of volatile, but CodeGen handles anything
 marked volatile very conservatively.  This should get fixed at some point.

 Common architectures have some way of representing at least a pointer-sized
 lock-free ``cmpxchg``; such an operation can be used to implement all the other
 atomic operations which can be represented in IR up to that size.  Backends are
 expected to implement all those operations, but not operations which cannot be
 implemented in a lock-free manner.  It is expected that backends will give an
 error when given an operation which cannot be implemented.  (The LLVM code
 generator is not very helpful here at the moment, but hopefully that will
 change.)

 The implementation of atomics on LL/SC architectures (like ARM) is currently a
 bit of a mess; there is a lot of copy-pasted code across targets, and the
 representation is relatively unsuited to optimization (it would be nice to be
 able to optimize loops involving cmpxchg etc.).

 On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
 generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
 fences generate an ``MFENCE``, other fences do not cause any code to be
 generated.  cmpxchg uses the ``LOCK CMPXCHG`` instruction.  ``atomicrmw xchg``
 uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
 other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``.  Depending
 on the users of the result, some ``atomicrmw`` operations can be translated into
 operations like ``LOCK AND``, but that does not work in general.

 On ARM, MIPS, and many other RISC architectures, Acquire, Release, and
 SequentiallyConsistent semantics require barrier instructions for every such
 operation. Loads and stores generate normal instructions.  ``cmpxchg`` and
 ``atomicrmw`` can be represented using a loop with LL/SC-style instructions
 which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
 on ARM, etc.). At the moment, the IR does not provide any way to represent a
 weak ``cmpxchg`` which would not require a loop.
	==============================================
	LLVM Atomic Instructions and Concurrency Guide
	==============================================

	.. contents::
	:local:

	Introduction
	============

	Historically, LLVM has not had very strong support for concurrency; some minimal
	intrinsics were provided, and ``volatile`` was used in some cases to achieve
	rough semantics in the presence of concurrency. However, this is changing;
	there are now new instructions which are well-defined in the presence of threads
	and asynchronous signals, and the model for existing instructions has been
	clarified in the IR.

	The atomic instructions are designed specifically to provide readable IR and
	optimized code generation for the following:

	* The new C++0x ``<atomic>`` header. (`C++0x draft available here
	<http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C1x draft available here
	<http://www.open-std.org/jtc1/sc22/wg14/>`_.)

	* Proper semantics for Java-style memory, for both ``volatile`` and regular
	shared variables. (`Java Specification
	<http://java.sun.com/docs/books/jls/third_edition/html/memory.html>`_)

	* gcc-compatible ``__sync_*`` builtins. (`Description
	<http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html>`_)

	* Other scenarios with atomic semantics, including ``static`` variables with
	non-trivial constructors in C++.

	Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
	which ensures that every volatile load and store happens and is performed in the
	stated order. A couple examples: if a SequentiallyConsistent store is
	immediately followed by another SequentiallyConsistent store to the same
	address, the first store can be erased. This transformation is not allowed for a
	pair of volatile stores. On the other hand, a non-volatile non-atomic load can
	be moved across a volatile load freely, but not an Acquire load.

	This document is intended to provide a guide to anyone either writing a frontend
	for LLVM or working on optimization passes for LLVM with a guide for how to deal
	with instructions with special semantics in the presence of concurrency. This
	is not intended to be a precise guide to the semantics; the details can get
	extremely complicated and unreadable, and are not usually necessary.

	.. _Optimization outside atomic:

	Optimization outside atomic
	===========================

	The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
	lead to undefined results in a concurrent environment; see `NotAtomic`_. This
	section specifically goes into the one optimizer restriction which applies in
	concurrent environments, which gets a bit more of an extended description
	because any optimization dealing with stores needs to be aware of it.

	From the optimizer's point of view, the rule is that if there are not any
	instructions with atomic ordering involved, concurrency does not matter, with
	one exception: if a variable might be visible to another thread or signal
	handler, a store cannot be inserted along a path where it might not execute
	otherwise. Take the following example:

	.. code-block:: c

	/* C code, for readability; run through clang -O2 -S -emit-llvm to get
	equivalent IR */
	int x;
	void f(int* a) {
	for (int i = 0; i < 100; i++) {
	if (a[i])
	x += 1;
	}
	}

	The following is equivalent in non-concurrent situations:

	.. code-block:: c

	int x;
	void f(int* a) {
	int xtemp = x;
	for (int i = 0; i < 100; i++) {
	if (a[i])
	xtemp += 1;
	}
	x = xtemp;
	}

	However, LLVM is not allowed to transform the former to the latter: it could
	indirectly introduce undefined behavior if another thread can access ``x`` at
	the same time. (This example is particularly of interest because before the
	concurrency model was implemented, LLVM would perform this transformation.)

	Note that speculative loads are allowed; a load which is part of a race returns
	``undef``, but does not have undefined behavior.

	Atomic instructions
	===================

	For cases where simple loads and stores are not sufficient, LLVM provides
	various atomic instructions. The exact guarantees provided depend on the
	ordering; see `Atomic orderings`_.

	``load atomic`` and ``store atomic`` provide the same basic functionality as
	non-atomic loads and stores, but provide additional guarantees in situations
	where threads and signals are involved.

	``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
	atomic store (where the store is conditional for ``cmpxchg``), but no other
	memory operation can happen on any thread between the load and store. Note that
	LLVM's cmpxchg does not provide quite as many options as the C++0x version.

	A ``fence`` provides Acquire and/or Release ordering which is not part of
	another operation; it is normally used along with Monotonic memory operations.
	A Monotonic load followed by an Acquire fence is roughly equivalent to an
	Acquire load.

	Frontends generating atomic instructions generally need to be aware of the
	target to some degree; atomic instructions are guaranteed to be lock-free, and
	therefore an instruction which is wider than the target natively supports can be
	impossible to generate.

	.. _Atomic orderings:

	Atomic orderings
	================

	In order to achieve a balance between performance and necessary guarantees,
	there are six levels of atomicity. They are listed in order of strength; each
	level includes all the guarantees of the previous level except for
	Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)

	.. _NotAtomic:

	NotAtomic
	---------

	NotAtomic is the obvious, a load or store which is not atomic. (This isn't
	really a level of atomicity, but is listed here for comparison.) This is
	essentially a regular load or store. If there is a race on a given memory
	location, loads from that location return undef.

	Relevant standard
	This is intended to match shared variables in C/C++, and to be used in any
	other context where memory access is necessary, and a race is impossible. (The
	precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)

	Notes for frontends
	The rule is essentially that all memory accessed with basic loads and stores
	by multiple threads should be protected by a lock or other synchronization;
	otherwise, you are likely to run into undefined behavior. If your frontend is
	for a "safe" language like Java, use Unordered to load and store any shared
	variable. Note that NotAtomic volatile loads and stores are not properly
	atomic; do not try to use them as a substitute. (Per the C/C++ standards,
	volatile does provide some limited guarantees around asynchronous signals, but
	atomics are generally a better solution.)

	Notes for optimizers
	Introducing loads to shared variables along a codepath where they would not
	otherwise exist is allowed; introducing stores to shared variables is not. See
	`Optimization outside atomic`_.

	Notes for code generation
	The one interesting restriction here is that it is not allowed to write to
	bytes outside of the bytes relevant to a store. This is mostly relevant to
	unaligned stores: it is not allowed in general to convert an unaligned store
	into two aligned stores of the same width as the unaligned store. Backends are
	also expected to generate an i8 store as an i8 store, and not an instruction
	which writes to surrounding bytes. (If you are writing a backend for an
	architecture which cannot satisfy these restrictions and cares about
	concurrency, please send an email to llvmdev.)

	Unordered
	---------

	Unordered is the lowest level of atomicity. It essentially guarantees that races
	produce somewhat sane results instead of having undefined behavior. It also
	guarantees the operation to be lock-free, so it do not depend on the data being
	part of a special atomic structure or depend on a separate per-process global
	lock. Note that code generation will fail for unsupported atomic operations; if
	you need such an operation, use explicit locking.

	Relevant standard
	This is intended to match the Java memory model for shared variables.

	Notes for frontends
	This cannot be used for synchronization, but is useful for Java and other
	"safe" languages which need to guarantee that the generated code never
	exhibits undefined behavior. Note that this guarantee is cheap on common
	platforms for loads of a native width, but can be expensive or unavailable for
	wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
	languages would normally split a 64-bit store on ARM into two 32-bit unordered
	stores.)

	Notes for optimizers
	In terms of the optimizer, this prohibits any transformation that transforms a
	single load into multiple loads, transforms a store into multiple stores,
	narrows a store, or stores a value which would not be stored otherwise. Some
	examples of unsafe optimizations are narrowing an assignment into a bitfield,
	rematerializing a load, and turning loads and stores into a memcpy
	call. Reordering unordered operations is safe, though, and optimizers should
	take advantage of that because unordered operations are common in languages
	that need them.

	Notes for code generation
	These operations are required to be atomic in the sense that if you use
	unordered loads and unordered stores, a load cannot see a value which was
	never stored. A normal load or store instruction is usually sufficient, but
	note that an unordered load or store cannot be split into multiple
	instructions (or an instruction which does multiple memory operations, like
	``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM).

	Monotonic
	---------

	Monotonic is the weakest level of atomicity that can be used in synchronization
	primitives, although it does not provide any general synchronization. It
	essentially guarantees that if you take all the operations affecting a specific
	address, a consistent ordering exists.

	Relevant standard
	This corresponds to the C++0x/C1x ``memory_order_relaxed``; see those
	standards for the exact definition.

	Notes for frontends
	If you are writing a frontend which uses this directly, use with caution. The
	guarantees in terms of synchronization are very weak, so make sure these are
	only used in a pattern which you know is correct. Generally, these would
	either be used for atomic operations which do not protect other memory (like
	an atomic counter), or along with a ``fence``.

	Notes for optimizers
	In terms of the optimizer, this can be treated as a read+write on the relevant
	memory location (and alias analysis will take advantage of that). In addition,
	it is legal to reorder non-atomic and Unordered loads around Monotonic
	loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
	operations are unlikely to be used in ways which would make those
	optimizations useful.

	Notes for code generation
	Code generation is essentially the same as that for unordered for loads and
	stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
	to appear as a single operation.

	Acquire
	-------

	Acquire provides a barrier of the sort necessary to acquire a lock to access
	other memory with normal loads and stores.

	Relevant standard
	This corresponds to the C++0x/C1x ``memory_order_acquire``. It should also be
	used for C++0x/C1x ``memory_order_consume``.

	Notes for frontends
	If you are writing a frontend which uses this directly, use with caution.
	Acquire only provides a semantic guarantee when paired with a Release
	operation.

	Notes for optimizers
	Optimizers not aware of atomics can treat this like a nothrow call. It is
	also possible to move stores from before an Acquire load or read-modify-write
	operation to after it, and move non-Acquire loads from before an Acquire
	operation to after it.

	Notes for code generation
	Architectures with weak memory ordering (essentially everything relevant today
	except x86 and SPARC) require some sort of fence to maintain the Acquire
	semantics. The precise fences required varies widely by architecture, but for
	a simple implementation, most architectures provide a barrier which is strong
	enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
	such a fence after the equivalent Monotonic operation is sufficient to
	maintain Acquire semantics for a memory operation.

	Release
	-------

	Release is similar to Acquire, but with a barrier of the sort necessary to
	release a lock.

	Relevant standard
	This corresponds to the C++0x/C1x ``memory_order_release``.

	Notes for frontends
	If you are writing a frontend which uses this directly, use with caution.
	Release only provides a semantic guarantee when paired with a Acquire
	operation.

	Notes for optimizers
	Optimizers not aware of atomics can treat this like a nothrow call. It is
	also possible to move loads from after a Release store or read-modify-write
	operation to before it, and move non-Release stores from after an Release
	operation to before it.

	Notes for code generation
	See the section on Acquire; a fence before the relevant operation is usually
	sufficient for Release. Note that a store-store fence is not sufficient to
	implement Release semantics; store-store fences are generally not exposed to
	IR because they are extremely difficult to use correctly.

	AcquireRelease
	--------------

	AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
	barrier (for fences and operations which both read and write memory).

	Relevant standard
	This corresponds to the C++0x/C1x ``memory_order_acq_rel``.

	Notes for frontends
	If you are writing a frontend which uses this directly, use with caution.
	Acquire only provides a semantic guarantee when paired with a Release
	operation, and vice versa.

	Notes for optimizers
	In general, optimizers should treat this like a nothrow call; the possible
	optimizations are usually not interesting.

	Notes for code generation
	This operation has Acquire and Release semantics; see the sections on Acquire
	and Release.

	SequentiallyConsistent
	----------------------

	SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
	and Release semantics for stores. Additionally, it guarantees that a total
	ordering exists between all SequentiallyConsistent operations.

	Relevant standard
	This corresponds to the C++0x/C1x ``memory_order_seq_cst``, Java volatile, and
	the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.

	Notes for frontends
	If a frontend is exposing atomic operations, these are much easier to reason
	about for the programmer than other kinds of operations, and using them is
	generally a practical performance tradeoff.

	Notes for optimizers
	Optimizers not aware of atomics can treat this like a nothrow call. For
	SequentiallyConsistent loads and stores, the same reorderings are allowed as
	for Acquire loads and Release stores, except that SequentiallyConsistent
	operations may not be reordered.

	Notes for code generation
	SequentiallyConsistent loads minimally require the same barriers as Acquire
	operations and SequentiallyConsistent stores require Release
	barriers. Additionally, the code generator must enforce ordering between
	SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
	is usually done by emitting either a full fence before the loads or a full
	fence after the stores; which is preferred varies by architecture.

	Atomics and IR optimization
	===========================

	Predicates for optimizer writers to query:

	* ``isSimple()``: A load or store which is not volatile or atomic. This is
	what, for example, memcpyopt would check for operations it might transform.

	* ``isUnordered()``: A load or store which is not volatile and at most
	Unordered. This would be checked, for example, by LICM before hoisting an
	operation.

	* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
	that they return true for any operation which is volatile or at least
	Monotonic.

	* Alias analysis: Note that AA will return ModRef for anything Acquire or
	Release, and for the address accessed by any Monotonic operation.

	To support optimizing around atomic operations, make sure you are using the
	right predicates; everything should work if that is done. If your pass should
	optimize some atomic operations (Unordered operations in particular), make sure
	it doesn't replace an atomic load or store with a non-atomic operation.

	Some examples of how optimizations interact with various kinds of atomic
	operations:

	* ``memcpyopt``: An atomic operation cannot be optimized into part of a
	memcpy/memset, including unordered loads/stores. It can pull operations
	across some atomic operations.

	* LICM: Unordered loads/stores can be moved out of a loop. It just treats
	monotonic operations like a read+write to a memory location, and anything
	stricter than that like a nothrow call.

	* DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
	be DSE'ed in some cases, but it's tricky to reason about, and not especially
	important.

	* Folding a load: Any atomic load from a constant global can be constant-folded,
	because it cannot be observed. Similar reasoning allows scalarrepl with
	atomic loads and stores.

	Atomics and Codegen
	===================

	Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
	On architectures which use barrier instructions for all atomic ordering (like
	ARM), appropriate fences are split out as the DAG is built.

	The MachineMemOperand for all atomic operations is currently marked as volatile;
	this is not correct in the IR sense of volatile, but CodeGen handles anything
	marked volatile very conservatively. This should get fixed at some point.

	Common architectures have some way of representing at least a pointer-sized
	lock-free ``cmpxchg``; such an operation can be used to implement all the other
	atomic operations which can be represented in IR up to that size. Backends are
	expected to implement all those operations, but not operations which cannot be
	implemented in a lock-free manner. It is expected that backends will give an
	error when given an operation which cannot be implemented. (The LLVM code
	generator is not very helpful here at the moment, but hopefully that will
	change.)

	The implementation of atomics on LL/SC architectures (like ARM) is currently a
	bit of a mess; there is a lot of copy-pasted code across targets, and the
	representation is relatively unsuited to optimization (it would be nice to be
	able to optimize loops involving cmpxchg etc.).

	On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
	generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
	fences generate an ``MFENCE``, other fences do not cause any code to be
	generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
	uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
	other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
	on the users of the result, some ``atomicrmw`` operations can be translated into
	operations like ``LOCK AND``, but that does not work in general.

	On ARM, MIPS, and many other RISC architectures, Acquire, Release, and
	SequentiallyConsistent semantics require barrier instructions for every such
	operation. Loads and stores generate normal instructions. ``cmpxchg`` and
	``atomicrmw`` can be represented using a loop with LL/SC-style instructions
	which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
	on ARM, etc.). At the moment, the IR does not provide any way to represent a
	weak ``cmpxchg`` which would not require a loop.