| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| <html> |
| <head> |
| <title>LLVM Atomic Instructions and Concurrency Guide</title> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <link rel="stylesheet" href="llvm.css" type="text/css"> |
| </head> |
| <body> |
| |
| <h1> |
| LLVM Atomic Instructions and Concurrency Guide |
| </h1> |
| |
| <ol> |
| <li><a href="#introduction">Introduction</a></li> |
| <li><a href="#loadstore">Load and store</a></li> |
| <li><a href="#otherinst">Other atomic instructions</a></li> |
| <li><a href="#ordering">Atomic orderings</a></li> |
| <li><a href="#iropt">Atomics and IR optimization</a></li> |
| <li><a href="#codegen">Atomics and Codegen</a></li> |
| </ol> |
| |
| <div class="doc_author"> |
| <p>Written by Eli Friedman</p> |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <h2> |
| <a name="introduction">Introduction</a> |
| </h2> |
| <!-- *********************************************************************** --> |
| |
| <div> |
| |
| <p>Historically, LLVM has not had very strong support for concurrency; some |
| minimal intrinsics were provided, and <code>volatile</code> was used in some |
| cases to achieve rough semantics in the presence of concurrency. However, this |
| is changing; there are now new instructions which are well-defined in the |
| presence of threads and asynchronous signals, and the model for existing |
| instructions has been clarified in the IR.</p> |
| |
| <p>The atomic instructions are designed specifically to provide readable IR and |
| optimized code generation for the following:</p> |
| <ul> |
| <li>The new C++0x <code><atomic></code> header. |
| (<a href="http://www.open-std.org/jtc1/sc22/wg21/">C++0x draft available here</a>.) |
| (<a href="http://www.open-std.org/jtc1/sc22/wg14/">C1x draft available here</a>)</li> |
| <li>Proper semantics for Java-style memory, for both <code>volatile</code> and |
| regular shared variables. |
| (<a href="http://java.sun.com/docs/books/jls/third_edition/html/memory.html">Java Specification</a>)</li> |
| <li>gcc-compatible <code>__sync_*</code> builtins. |
| (<a href="http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html">Description</a>)</li> |
| <li>Other scenarios with atomic semantics, including <code>static</code> |
| variables with non-trivial constructors in C++.</li> |
| </ul> |
| |
| <p>Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ |
| volatile, which ensures that every volatile load and store happens and is |
| performed in the stated order. A couple examples: if a |
| SequentiallyConsistent store is immediately followed by another |
| SequentiallyConsistent store to the same address, the first store can |
| be erased. This transformation is not allowed for a pair of volatile |
| stores. On the other hand, a non-volatile non-atomic load can be moved |
| across a volatile load freely, but not an Acquire load.</p> |
| |
| <p>This document is intended to provide a guide to anyone either writing a |
| frontend for LLVM or working on optimization passes for LLVM with a guide |
| for how to deal with instructions with special semantics in the presence of |
| concurrency. This is not intended to be a precise guide to the semantics; |
| the details can get extremely complicated and unreadable, and are not |
| usually necessary.</p> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <h2> |
| <a name="loadstore">Load and store</a> |
| </h2> |
| <!-- *********************************************************************** --> |
| |
| <div> |
| |
| <p>The basic <code>'load'</code> and <code>'store'</code> allow a variety of |
| optimizations, but can have unintuitive results in a concurrent environment. |
| For a frontend writer, the rule is essentially that all memory accessed |
| with basic loads and stores by multiple threads should be protected by a |
| lock or other synchronization; otherwise, you are likely to run into |
| undefined behavior. (Do not use volatile as a substitute for atomics; it |
| might work on some platforms, but does not provide the necessary guarantees |
| in general.)</p> |
| |
| <p>From the optimizer's point of view, the rule is that if there |
| are not any instructions with atomic ordering involved, concurrency does |
| not matter, with one exception: if a variable might be visible to another |
| thread or signal handler, a store cannot be inserted along a path where it |
| might not execute otherwise. For example, suppose LICM wants to take all the |
| loads and stores in a loop to and from a particular address and promote them |
| to registers. LICM is not allowed to insert an unconditional store after |
| the loop with the computed value unless a store unconditionally executes |
| within the loop. Note that speculative loads are allowed; a load which |
| is part of a race returns <code>undef</code>, but does not have undefined |
| behavior.</p> |
| |
| <p>For cases where simple loads and stores are not sufficient, LLVM provides |
| atomic loads and stores with varying levels of guarantees.</p> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <h2> |
| <a name="otherinst">Other atomic instructions</a> |
| </h2> |
| <!-- *********************************************************************** --> |
| |
| <div> |
| |
| <p><code>cmpxchg</code> and <code>atomicrmw</code> are essentially like an |
| atomic load followed by an atomic store (where the store is conditional for |
| <code>cmpxchg</code>), but no other memory operation can happen between |
| the load and store. Note that our cmpxchg does not have quite as many |
| options for making cmpxchg weaker as the C++0x version.</p> |
| |
| <p>A <code>fence</code> provides Acquire and/or Release ordering which is not |
| part of another operation; it is normally used along with Monotonic memory |
| operations. A Monotonic load followed by an Acquire fence is roughly |
| equivalent to an Acquire load.</p> |
| |
| <p>Frontends generating atomic instructions generally need to be aware of the |
| target to some degree; atomic instructions are guaranteed to be lock-free, |
| and therefore an instruction which is wider than the target natively supports |
| can be impossible to generate.</p> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <h2> |
| <a name="ordering">Atomic orderings</a> |
| </h2> |
| <!-- *********************************************************************** --> |
| |
| <div> |
| |
| <p>In order to achieve a balance between performance and necessary guarantees, |
| there are six levels of atomicity. They are listed in order of strength; |
| each level includes all the guarantees of the previous level except for |
| Acquire/Release.</p> |
| |
| <!-- ======================================================================= --> |
| <h3> |
| <a name="o_unordered">Unordered</a> |
| </h3> |
| |
| <div> |
| |
| <p>Unordered is the lowest level of atomicity. It essentially guarantees that |
| races produce somewhat sane results instead of having undefined behavior. |
| It also guarantees the operation to be lock-free, so it do not depend on |
| the data being part of a special atomic structure or depend on a separate |
| per-process global lock. Note that code generation will fail for |
| unsupported atomic operations; if you need such an operation, use explicit |
| locking.</p> |
| |
| <dl> |
| <dt>Relevant standard</dt> |
| <dd>This is intended to match the Java memory model for shared |
| variables.</dd> |
| <dt>Notes for frontends</dt> |
| <dd>This cannot be used for synchronization, but is useful for Java and |
| other "safe" languages which need to guarantee that the generated |
| code never exhibits undefined behavior. Note that this guarantee |
| is cheap on common platforms for loads of a native width, but can |
| be expensive or unavailable for wider loads, like a 64-bit store |
| on ARM. (A frontend for Java or other "safe" languages would normally |
| split a 64-bit store on ARM into two 32-bit unordered stores.) |
| <dt>Notes for optimizers</dt> |
| <dd>In terms of the optimizer, this prohibits any transformation that |
| transforms a single load into multiple loads, transforms a store |
| into multiple stores, narrows a store, or stores a value which |
| would not be stored otherwise. Some examples of unsafe optimizations |
| are narrowing an assignment into a bitfield, rematerializing |
| a load, and turning loads and stores into a memcpy call. Reordering |
| unordered operations is safe, though, and optimizers should take |
| advantage of that because unordered operations are common in |
| languages that need them.</dd> |
| <dt>Notes for code generation</dt> |
| <dd>These operations are required to be atomic in the sense that if you |
| use unordered loads and unordered stores, a load cannot see a value |
| which was never stored. A normal load or store instruction is usually |
| sufficient, but note that an unordered load or store cannot |
| be split into multiple instructions (or an instruction which |
| does multiple memory operations, like <code>LDRD</code> on ARM).</dd> |
| </dl> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <h3> |
| <a name="o_monotonic">Monotonic</a> |
| </h3> |
| |
| <div> |
| |
| <p>Monotonic is the weakest level of atomicity that can be used in |
| synchronization primitives, although it does not provide any general |
| synchronization. It essentially guarantees that if you take all the |
| operations affecting a specific address, a consistent ordering exists. |
| |
| <dl> |
| <dt>Relevant standard</dt> |
| <dd>This corresponds to the C++0x/C1x <code>memory_order_relaxed</code>; |
| see those standards for the exact definition. |
| <dt>Notes for frontends</dt> |
| <dd>If you are writing a frontend which uses this directly, use with caution. |
| The guarantees in terms of synchronization are very weak, so make |
| sure these are only used in a pattern which you know is correct. |
| Generally, these would either be used for atomic operations which |
| do not protect other memory (like an atomic counter), or along with |
| a <code>fence</code>.</dd> |
| <dt>Notes for optimizers</dt> |
| <dd>In terms of the optimizer, this can be treated as a read+write on the |
| relevant memory location (and alias analysis will take advantage of |
| that). In addition, it is legal to reorder non-atomic and Unordered |
| loads around Monotonic loads. CSE/DSE and a few other optimizations |
| are allowed, but Monotonic operations are unlikely to be used in ways |
| which would make those optimizations useful.</dd> |
| <dt>Notes for code generation</dt> |
| <dd>Code generation is essentially the same as that for unordered for loads |
| and stores. No fences is required. <code>cmpxchg</code> and |
| <code>atomicrmw</code> are required to appear as a single operation.</dd> |
| </dl> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <h3> |
| <a name="o_acquire">Acquire</a> |
| </h3> |
| |
| <div> |
| |
| <p>Acquire provides a barrier of the sort necessary to acquire a lock to access |
| other memory with normal loads and stores. |
| |
| <dl> |
| <dt>Relevant standard</dt> |
| <dd>This corresponds to the C++0x/C1x <code>memory_order_acquire</code>. It |
| should also be used for C++0x/C1x <code>memory_order_consume</code>. |
| <dt>Notes for frontends</dt> |
| <dd>If you are writing a frontend which uses this directly, use with caution. |
| Acquire only provides a semantic guarantee when paired with a Release |
| operation.</dd> |
| <dt>Notes for optimizers</dt> |
| <dd>In general, optimizers should treat this like a nothrow call; the |
| the possible optimizations are usually not interesting.</dd> |
| <dt>Notes for code generation</dt> |
| <dd>Architectures with weak memory ordering (essentially everything relevant |
| today except x86 and SPARC) require some sort of fence to maintain |
| the Acquire semantics. The precise fences required varies widely by |
| architecture, but for a simple implementation, most architectures provide |
| a barrier which is strong enough for everything (<code>dmb</code> on ARM, |
| <code>sync</code> on PowerPC, etc.). Putting such a fence after the |
| equivalent Monotonic operation is sufficient to maintain Acquire |
| semantics for a memory operation.</dd> |
| </dl> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <h3> |
| <a name="o_acquire">Release</a> |
| </h3> |
| |
| <div> |
| |
| <p>Release is similar to Acquire, but with a barrier of the sort necessary to |
| release a lock. |
| |
| <dl> |
| <dt>Relevant standard</dt> |
| <dd>This corresponds to the C++0x/C1x <code>memory_order_release</code>.</dd> |
| <dt>Notes for frontends</dt> |
| <dd>If you are writing a frontend which uses this directly, use with caution. |
| Release only provides a semantic guarantee when paired with a Acquire |
| operation.</dd> |
| <dt>Notes for optimizers</dt> |
| <dd>In general, optimizers should treat this like a nothrow call; the |
| the possible optimizations are usually not interesting.</dd> |
| <dt>Notes for code generation</dt> |
| <dd>Similarly to Acquire, a fence after the relevant operation is usually |
| sufficient; see the section on Acquire. Note that a store-store fence |
| is not sufficient to implement Release semantics; store-store fences |
| are generally not exposed to IR because they are extremely difficult to |
| use correctly.</dd> |
| </dl> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <h3> |
| <a name="o_acqrel">AcquireRelease</a> |
| </h3> |
| |
| <div> |
| |
| <p>AcquireRelease (<code>acq_rel</code> in IR) provides both an Acquire and a |
| Release barrier (for fences and operations which both read and write memory). |
| |
| <dl> |
| <dt>Relevant standard</dt> |
| <dd>This corresponds to the C++0x/C1x <code>memory_order_acq_rel</code>. |
| <dt>Notes for frontends</dt> |
| <dd>If you are writing a frontend which uses this directly, use with caution. |
| Acquire only provides a semantic guarantee when paired with a Release |
| operation, and vice versa.</dd> |
| <dt>Notes for optimizers</dt> |
| <dd>In general, optimizers should treat this like a nothrow call; the |
| the possible optimizations are usually not interesting.</dd> |
| <dt>Notes for code generation</dt> |
| <dd>This operation has Acquire and Release semantics; see the sections on |
| Acquire and Release.</dd> |
| </dl> |
| |
| </div> |
| |
| <!-- ======================================================================= --> |
| <h3> |
| <a name="o_seqcst">SequentiallyConsistent</a> |
| </h3> |
| |
| <div> |
| |
| <p>SequentiallyConsistent (<code>seq_cst</code> in IR) provides Acquire and/or |
| Release semantics, and in addition guarantees a total ordering exists with |
| all other SequentiallyConsistent operations. |
| |
| <dl> |
| <dt>Relevant standard</dt> |
| <dd>This corresponds to the C++0x/C1x <code>memory_order_seq_cst</code>, |
| Java volatile, and the gcc-compatible <code>__sync_*</code> builtins |
| which do not specify otherwise. |
| <dt>Notes for frontends</dt> |
| <dd>If a frontend is exposing atomic operations, these are much easier to |
| reason about for the programmer than other kinds of operations, and using |
| them is generally a practical performance tradeoff.</dd> |
| <dt>Notes for optimizers</dt> |
| <dd>In general, optimizers should treat this like a nothrow call; the |
| the possible optimizations are usually not interesting.</dd> |
| <dt>Notes for code generation</dt> |
| <dd>SequentiallyConsistent operations generally require the strongest |
| barriers supported by the architecture.</dd> |
| </dl> |
| |
| </div> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <h2> |
| <a name="iropt">Atomics and IR optimization</a> |
| </h2> |
| <!-- *********************************************************************** --> |
| |
| <div> |
| |
| <p>Predicates for optimizer writers to query: |
| <ul> |
| <li>isSimple(): A load or store which is not volatile or atomic. This is |
| what, for example, memcpyopt would check for operations it might |
| transform. |
| <li>isUnordered(): A load or store which is not volatile and at most |
| Unordered. This would be checked, for example, by LICM before hoisting |
| an operation. |
| <li>mayReadFromMemory()/mayWriteToMemory(): Existing predicate, but note |
| that they return true for any operation which is volatile or at least |
| Monotonic. |
| <li>Alias analysis: Note that AA will return ModRef for anything Acquire or |
| Release, and for the address accessed by any Monotonic operation. |
| </ul> |
| |
| <p>There are essentially two components to supporting atomic operations. The |
| first is making sure to query isSimple() or isUnordered() instead |
| of isVolatile() before transforming an operation. The other piece is |
| making sure that a transform does not end up replacing, for example, an |
| Unordered operation with a non-atomic operation. Most of the other |
| necessary checks automatically fall out from existing predicates and |
| alias analysis queries.</p> |
| |
| <p>Some examples of how optimizations interact with various kinds of atomic |
| operations: |
| <ul> |
| <li>memcpyopt: An atomic operation cannot be optimized into part of a |
| memcpy/memset, including unordered loads/stores. It can pull operations |
| across some atomic operations. |
| <li>LICM: Unordered loads/stores can be moved out of a loop. It just treats |
| monotonic operations like a read+write to a memory location, and anything |
| stricter than that like a nothrow call. |
| <li>DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores |
| can be DSE'ed in some cases, but it's tricky to reason about, and not |
| especially important. |
| <li>Folding a load: Any atomic load from a constant global can be |
| constant-folded, because it cannot be observed. Similar reasoning allows |
| scalarrepl with atomic loads and stores. |
| </ul> |
| |
| </div> |
| |
| <!-- *********************************************************************** --> |
| <h2> |
| <a name="codegen">Atomics and Codegen</a> |
| </h2> |
| <!-- *********************************************************************** --> |
| |
| <div> |
| |
| <p>Atomic operations are represented in the SelectionDAG with |
| <code>ATOMIC_*</code> opcodes. On architectures which use barrier |
| instructions for all atomic ordering (like ARM), appropriate fences are |
| split out as the DAG is built.</p> |
| |
| <p>The MachineMemOperand for all atomic operations is currently marked as |
| volatile; this is not correct in the IR sense of volatile, but CodeGen |
| handles anything marked volatile very conservatively. This should get |
| fixed at some point.</p> |
| |
| <p>Common architectures have some way of representing at least a pointer-sized |
| lock-free <code>cmpxchg</code>; such an operation can be used to implement |
| all the other atomic operations which can be represented in IR up to that |
| size. Backends are expected to implement all those operations, but not |
| operations which cannot be implemented in a lock-free manner. It is |
| expected that backends will give an error when given an operation which |
| cannot be implemented. (The LLVM code generator is not very helpful here |
| at the moment, but hopefully that will change.)</p> |
| |
| <p>The implementation of atomics on LL/SC architectures (like ARM) is currently |
| a bit of a mess; there is a lot of copy-pasted code across targets, and |
| the representation is relatively unsuited to optimization (it would be nice |
| to be able to optimize loops involving cmpxchg etc.).</p> |
| |
| <p>On x86, all atomic loads generate a <code>MOV</code>. |
| SequentiallyConsistent stores generate an <code>XCHG</code>, other stores |
| generate a <code>MOV</code>. SequentiallyConsistent fences generate an |
| <code>MFENCE</code>, other fences do not cause any code to be generated. |
| cmpxchg uses the <code>LOCK CMPXCHG</code> instruction. |
| <code>atomicrmw xchg</code> uses <code>XCHG</code>, |
| <code>atomicrmw add</code> and <code>atomicrmw sub</code> use |
| <code>XADD</code>, and all other <code>atomicrmw</code> operations generate |
| a loop with <code>LOCK CMPXCHG</code>. Depending on the users of the |
| result, some <code>atomicrmw</code> operations can be translated into |
| operations like <code>LOCK AND</code>, but that does not work in |
| general.</p> |
| |
| <p>On ARM, MIPS, and many other RISC architectures, Acquire, Release, and |
| SequentiallyConsistent semantics require barrier instructions |
| for every such operation. Loads and stores generate normal instructions. |
| <code>cmpxchg</code> and <code>atomicrmw</code> can be represented using |
| a loop with LL/SC-style instructions which take some sort of exclusive |
| lock on a cache line (<code>LDREX</code> and <code>STREX</code> on |
| ARM, etc.). At the moment, the IR does not provide any way to represent a |
| weak <code>cmpxchg</code> which would not require a loop.</p> |
| </div> |
| |
| <!-- *********************************************************************** --> |
| |
| <hr> |
| <address> |
| <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| src="http://jigsaw.w3.org/css-validator/images/vcss-blue" alt="Valid CSS"></a> |
| <a href="http://validator.w3.org/check/referer"><img |
| src="http://www.w3.org/Icons/valid-html401-blue" alt="Valid HTML 4.01"></a> |
| |
| <a href="http://llvm.org/">LLVM Compiler Infrastructure</a><br> |
| Last modified: $Date: 2011-08-09 02:07:00 -0700 (Tue, 09 Aug 2011) $ |
| </address> |
| |
| </body> |
| </html> |