Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 1 | .. _atomics: |
| 2 | |
| 3 | ============================================== |
| 4 | LLVM Atomic Instructions and Concurrency Guide |
| 5 | ============================================== |
| 6 | |
| 7 | .. contents:: |
| 8 | :local: |
| 9 | |
| 10 | Introduction |
| 11 | ============ |
| 12 | |
| 13 | Historically, LLVM has not had very strong support for concurrency; some minimal |
| 14 | intrinsics were provided, and ``volatile`` was used in some cases to achieve |
| 15 | rough semantics in the presence of concurrency. However, this is changing; |
| 16 | there are now new instructions which are well-defined in the presence of threads |
| 17 | and asynchronous signals, and the model for existing instructions has been |
| 18 | clarified in the IR. |
| 19 | |
| 20 | The atomic instructions are designed specifically to provide readable IR and |
| 21 | optimized code generation for the following: |
| 22 | |
| 23 | * The new C++0x ``<atomic>`` header. (`C++0x draft available here |
| 24 | <http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C1x draft available here |
| 25 | <http://www.open-std.org/jtc1/sc22/wg14/>`_.) |
| 26 | |
| 27 | * Proper semantics for Java-style memory, for both ``volatile`` and regular |
| 28 | shared variables. (`Java Specification |
| 29 | <http://java.sun.com/docs/books/jls/third_edition/html/memory.html>`_) |
| 30 | |
| 31 | * gcc-compatible ``__sync_*`` builtins. (`Description |
| 32 | <http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html>`_) |
| 33 | |
| 34 | * Other scenarios with atomic semantics, including ``static`` variables with |
| 35 | non-trivial constructors in C++. |
| 36 | |
| 37 | Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile, |
| 38 | which ensures that every volatile load and store happens and is performed in the |
| 39 | stated order. A couple examples: if a SequentiallyConsistent store is |
| 40 | immediately followed by another SequentiallyConsistent store to the same |
| 41 | address, the first store can be erased. This transformation is not allowed for a |
| 42 | pair of volatile stores. On the other hand, a non-volatile non-atomic load can |
| 43 | be moved across a volatile load freely, but not an Acquire load. |
| 44 | |
| 45 | This document is intended to provide a guide to anyone either writing a frontend |
| 46 | for LLVM or working on optimization passes for LLVM with a guide for how to deal |
| 47 | with instructions with special semantics in the presence of concurrency. This |
| 48 | is not intended to be a precise guide to the semantics; the details can get |
| 49 | extremely complicated and unreadable, and are not usually necessary. |
| 50 | |
| 51 | .. _Optimization outside atomic: |
| 52 | |
| 53 | Optimization outside atomic |
| 54 | =========================== |
| 55 | |
| 56 | The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can |
| 57 | lead to undefined results in a concurrent environment; see `NotAtomic`_. This |
| 58 | section specifically goes into the one optimizer restriction which applies in |
| 59 | concurrent environments, which gets a bit more of an extended description |
| 60 | because any optimization dealing with stores needs to be aware of it. |
| 61 | |
| 62 | From the optimizer's point of view, the rule is that if there are not any |
| 63 | instructions with atomic ordering involved, concurrency does not matter, with |
| 64 | one exception: if a variable might be visible to another thread or signal |
| 65 | handler, a store cannot be inserted along a path where it might not execute |
| 66 | otherwise. Take the following example: |
| 67 | |
| 68 | .. code-block:: c |
| 69 | |
| 70 | /* C code, for readability; run through clang -O2 -S -emit-llvm to get |
| 71 | equivalent IR */ |
| 72 | int x; |
| 73 | void f(int* a) { |
| 74 | for (int i = 0; i < 100; i++) { |
| 75 | if (a[i]) |
| 76 | x += 1; |
| 77 | } |
| 78 | } |
| 79 | |
| 80 | The following is equivalent in non-concurrent situations: |
| 81 | |
| 82 | .. code-block:: c |
| 83 | |
| 84 | int x; |
| 85 | void f(int* a) { |
| 86 | int xtemp = x; |
| 87 | for (int i = 0; i < 100; i++) { |
| 88 | if (a[i]) |
| 89 | xtemp += 1; |
| 90 | } |
| 91 | x = xtemp; |
| 92 | } |
| 93 | |
| 94 | However, LLVM is not allowed to transform the former to the latter: it could |
| 95 | indirectly introduce undefined behavior if another thread can access ``x`` at |
| 96 | the same time. (This example is particularly of interest because before the |
| 97 | concurrency model was implemented, LLVM would perform this transformation.) |
| 98 | |
| 99 | Note that speculative loads are allowed; a load which is part of a race returns |
| 100 | ``undef``, but does not have undefined behavior. |
| 101 | |
| 102 | Atomic instructions |
| 103 | =================== |
| 104 | |
| 105 | For cases where simple loads and stores are not sufficient, LLVM provides |
| 106 | various atomic instructions. The exact guarantees provided depend on the |
| 107 | ordering; see `Atomic orderings`_. |
| 108 | |
| 109 | ``load atomic`` and ``store atomic`` provide the same basic functionality as |
| 110 | non-atomic loads and stores, but provide additional guarantees in situations |
| 111 | where threads and signals are involved. |
| 112 | |
| 113 | ``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an |
| 114 | atomic store (where the store is conditional for ``cmpxchg``), but no other |
| 115 | memory operation can happen on any thread between the load and store. Note that |
| 116 | LLVM's cmpxchg does not provide quite as many options as the C++0x version. |
| 117 | |
| 118 | A ``fence`` provides Acquire and/or Release ordering which is not part of |
| 119 | another operation; it is normally used along with Monotonic memory operations. |
| 120 | A Monotonic load followed by an Acquire fence is roughly equivalent to an |
| 121 | Acquire load. |
| 122 | |
| 123 | Frontends generating atomic instructions generally need to be aware of the |
| 124 | target to some degree; atomic instructions are guaranteed to be lock-free, and |
| 125 | therefore an instruction which is wider than the target natively supports can be |
| 126 | impossible to generate. |
| 127 | |
| 128 | .. _Atomic orderings: |
| 129 | |
| 130 | Atomic orderings |
| 131 | ================ |
| 132 | |
| 133 | In order to achieve a balance between performance and necessary guarantees, |
| 134 | there are six levels of atomicity. They are listed in order of strength; each |
| 135 | level includes all the guarantees of the previous level except for |
| 136 | Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.) |
| 137 | |
| 138 | .. _NotAtomic: |
| 139 | |
| 140 | NotAtomic |
| 141 | --------- |
| 142 | |
| 143 | NotAtomic is the obvious, a load or store which is not atomic. (This isn't |
| 144 | really a level of atomicity, but is listed here for comparison.) This is |
| 145 | essentially a regular load or store. If there is a race on a given memory |
| 146 | location, loads from that location return undef. |
| 147 | |
| 148 | Relevant standard |
| 149 | This is intended to match shared variables in C/C++, and to be used in any |
| 150 | other context where memory access is necessary, and a race is impossible. (The |
| 151 | precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.) |
| 152 | |
| 153 | Notes for frontends |
| 154 | The rule is essentially that all memory accessed with basic loads and stores |
| 155 | by multiple threads should be protected by a lock or other synchronization; |
| 156 | otherwise, you are likely to run into undefined behavior. If your frontend is |
| 157 | for a "safe" language like Java, use Unordered to load and store any shared |
| 158 | variable. Note that NotAtomic volatile loads and stores are not properly |
| 159 | atomic; do not try to use them as a substitute. (Per the C/C++ standards, |
| 160 | volatile does provide some limited guarantees around asynchronous signals, but |
| 161 | atomics are generally a better solution.) |
| 162 | |
| 163 | Notes for optimizers |
| 164 | Introducing loads to shared variables along a codepath where they would not |
| 165 | otherwise exist is allowed; introducing stores to shared variables is not. See |
| 166 | `Optimization outside atomic`_. |
| 167 | |
| 168 | Notes for code generation |
| 169 | The one interesting restriction here is that it is not allowed to write to |
| 170 | bytes outside of the bytes relevant to a store. This is mostly relevant to |
| 171 | unaligned stores: it is not allowed in general to convert an unaligned store |
| 172 | into two aligned stores of the same width as the unaligned store. Backends are |
| 173 | also expected to generate an i8 store as an i8 store, and not an instruction |
| 174 | which writes to surrounding bytes. (If you are writing a backend for an |
| 175 | architecture which cannot satisfy these restrictions and cares about |
| 176 | concurrency, please send an email to llvmdev.) |
| 177 | |
| 178 | Unordered |
| 179 | --------- |
| 180 | |
| 181 | Unordered is the lowest level of atomicity. It essentially guarantees that races |
| 182 | produce somewhat sane results instead of having undefined behavior. It also |
| 183 | guarantees the operation to be lock-free, so it do not depend on the data being |
| 184 | part of a special atomic structure or depend on a separate per-process global |
| 185 | lock. Note that code generation will fail for unsupported atomic operations; if |
| 186 | you need such an operation, use explicit locking. |
| 187 | |
| 188 | Relevant standard |
| 189 | This is intended to match the Java memory model for shared variables. |
| 190 | |
| 191 | Notes for frontends |
| 192 | This cannot be used for synchronization, but is useful for Java and other |
| 193 | "safe" languages which need to guarantee that the generated code never |
| 194 | exhibits undefined behavior. Note that this guarantee is cheap on common |
| 195 | platforms for loads of a native width, but can be expensive or unavailable for |
| 196 | wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe" |
| 197 | languages would normally split a 64-bit store on ARM into two 32-bit unordered |
| 198 | stores.) |
| 199 | |
| 200 | Notes for optimizers |
| 201 | In terms of the optimizer, this prohibits any transformation that transforms a |
| 202 | single load into multiple loads, transforms a store into multiple stores, |
| 203 | narrows a store, or stores a value which would not be stored otherwise. Some |
| 204 | examples of unsafe optimizations are narrowing an assignment into a bitfield, |
| 205 | rematerializing a load, and turning loads and stores into a memcpy |
| 206 | call. Reordering unordered operations is safe, though, and optimizers should |
| 207 | take advantage of that because unordered operations are common in languages |
| 208 | that need them. |
| 209 | |
| 210 | Notes for code generation |
| 211 | These operations are required to be atomic in the sense that if you use |
| 212 | unordered loads and unordered stores, a load cannot see a value which was |
| 213 | never stored. A normal load or store instruction is usually sufficient, but |
| 214 | note that an unordered load or store cannot be split into multiple |
| 215 | instructions (or an instruction which does multiple memory operations, like |
| 216 | ``LDRD`` on ARM). |
| 217 | |
| 218 | Monotonic |
| 219 | --------- |
| 220 | |
| 221 | Monotonic is the weakest level of atomicity that can be used in synchronization |
| 222 | primitives, although it does not provide any general synchronization. It |
| 223 | essentially guarantees that if you take all the operations affecting a specific |
| 224 | address, a consistent ordering exists. |
| 225 | |
| 226 | Relevant standard |
| 227 | This corresponds to the C++0x/C1x ``memory_order_relaxed``; see those |
| 228 | standards for the exact definition. |
| 229 | |
| 230 | Notes for frontends |
| 231 | If you are writing a frontend which uses this directly, use with caution. The |
| 232 | guarantees in terms of synchronization are very weak, so make sure these are |
| 233 | only used in a pattern which you know is correct. Generally, these would |
| 234 | either be used for atomic operations which do not protect other memory (like |
| 235 | an atomic counter), or along with a ``fence``. |
| 236 | |
| 237 | Notes for optimizers |
| 238 | In terms of the optimizer, this can be treated as a read+write on the relevant |
| 239 | memory location (and alias analysis will take advantage of that). In addition, |
| 240 | it is legal to reorder non-atomic and Unordered loads around Monotonic |
| 241 | loads. CSE/DSE and a few other optimizations are allowed, but Monotonic |
| 242 | operations are unlikely to be used in ways which would make those |
| 243 | optimizations useful. |
| 244 | |
| 245 | Notes for code generation |
| 246 | Code generation is essentially the same as that for unordered for loads and |
| 247 | stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required |
| 248 | to appear as a single operation. |
| 249 | |
| 250 | Acquire |
| 251 | ------- |
| 252 | |
| 253 | Acquire provides a barrier of the sort necessary to acquire a lock to access |
| 254 | other memory with normal loads and stores. |
| 255 | |
| 256 | Relevant standard |
| 257 | This corresponds to the C++0x/C1x ``memory_order_acquire``. It should also be |
| 258 | used for C++0x/C1x ``memory_order_consume``. |
| 259 | |
| 260 | Notes for frontends |
| 261 | If you are writing a frontend which uses this directly, use with caution. |
| 262 | Acquire only provides a semantic guarantee when paired with a Release |
| 263 | operation. |
| 264 | |
| 265 | Notes for optimizers |
| 266 | Optimizers not aware of atomics can treat this like a nothrow call. It is |
| 267 | also possible to move stores from before an Acquire load or read-modify-write |
| 268 | operation to after it, and move non-Acquire loads from before an Acquire |
| 269 | operation to after it. |
| 270 | |
| 271 | Notes for code generation |
| 272 | Architectures with weak memory ordering (essentially everything relevant today |
| 273 | except x86 and SPARC) require some sort of fence to maintain the Acquire |
| 274 | semantics. The precise fences required varies widely by architecture, but for |
| 275 | a simple implementation, most architectures provide a barrier which is strong |
| 276 | enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting |
| 277 | such a fence after the equivalent Monotonic operation is sufficient to |
| 278 | maintain Acquire semantics for a memory operation. |
| 279 | |
| 280 | Release |
| 281 | ------- |
| 282 | |
| 283 | Release is similar to Acquire, but with a barrier of the sort necessary to |
| 284 | release a lock. |
| 285 | |
| 286 | Relevant standard |
| 287 | This corresponds to the C++0x/C1x ``memory_order_release``. |
| 288 | |
| 289 | Notes for frontends |
| 290 | If you are writing a frontend which uses this directly, use with caution. |
| 291 | Release only provides a semantic guarantee when paired with a Acquire |
| 292 | operation. |
| 293 | |
| 294 | Notes for optimizers |
| 295 | Optimizers not aware of atomics can treat this like a nothrow call. It is |
| 296 | also possible to move loads from after a Release store or read-modify-write |
| 297 | operation to before it, and move non-Release stores from after an Release |
| 298 | operation to before it. |
| 299 | |
| 300 | Notes for code generation |
| 301 | See the section on Acquire; a fence before the relevant operation is usually |
| 302 | sufficient for Release. Note that a store-store fence is not sufficient to |
| 303 | implement Release semantics; store-store fences are generally not exposed to |
| 304 | IR because they are extremely difficult to use correctly. |
| 305 | |
| 306 | AcquireRelease |
| 307 | -------------- |
| 308 | |
| 309 | AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release |
| 310 | barrier (for fences and operations which both read and write memory). |
| 311 | |
| 312 | Relevant standard |
| 313 | This corresponds to the C++0x/C1x ``memory_order_acq_rel``. |
| 314 | |
| 315 | Notes for frontends |
| 316 | If you are writing a frontend which uses this directly, use with caution. |
| 317 | Acquire only provides a semantic guarantee when paired with a Release |
| 318 | operation, and vice versa. |
| 319 | |
| 320 | Notes for optimizers |
Sylvestre Ledru | c8e41c5 | 2012-07-23 08:51:15 +0000 | [diff] [blame] | 321 | In general, optimizers should treat this like a nothrow call; the possible |
Bill Wendling | 3bf24bd | 2012-06-29 09:00:01 +0000 | [diff] [blame] | 322 | optimizations are usually not interesting. |
| 323 | |
| 324 | Notes for code generation |
| 325 | This operation has Acquire and Release semantics; see the sections on Acquire |
| 326 | and Release. |
| 327 | |
| 328 | SequentiallyConsistent |
| 329 | ---------------------- |
| 330 | |
| 331 | SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads |
| 332 | and Release semantics for stores. Additionally, it guarantees that a total |
| 333 | ordering exists between all SequentiallyConsistent operations. |
| 334 | |
| 335 | Relevant standard |
| 336 | This corresponds to the C++0x/C1x ``memory_order_seq_cst``, Java volatile, and |
| 337 | the gcc-compatible ``__sync_*`` builtins which do not specify otherwise. |
| 338 | |
| 339 | Notes for frontends |
| 340 | If a frontend is exposing atomic operations, these are much easier to reason |
| 341 | about for the programmer than other kinds of operations, and using them is |
| 342 | generally a practical performance tradeoff. |
| 343 | |
| 344 | Notes for optimizers |
| 345 | Optimizers not aware of atomics can treat this like a nothrow call. For |
| 346 | SequentiallyConsistent loads and stores, the same reorderings are allowed as |
| 347 | for Acquire loads and Release stores, except that SequentiallyConsistent |
| 348 | operations may not be reordered. |
| 349 | |
| 350 | Notes for code generation |
| 351 | SequentiallyConsistent loads minimally require the same barriers as Acquire |
| 352 | operations and SequentiallyConsistent stores require Release |
| 353 | barriers. Additionally, the code generator must enforce ordering between |
| 354 | SequentiallyConsistent stores followed by SequentiallyConsistent loads. This |
| 355 | is usually done by emitting either a full fence before the loads or a full |
| 356 | fence after the stores; which is preferred varies by architecture. |
| 357 | |
| 358 | Atomics and IR optimization |
| 359 | =========================== |
| 360 | |
| 361 | Predicates for optimizer writers to query: |
| 362 | |
| 363 | * ``isSimple()``: A load or store which is not volatile or atomic. This is |
| 364 | what, for example, memcpyopt would check for operations it might transform. |
| 365 | |
| 366 | * ``isUnordered()``: A load or store which is not volatile and at most |
| 367 | Unordered. This would be checked, for example, by LICM before hoisting an |
| 368 | operation. |
| 369 | |
| 370 | * ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note |
| 371 | that they return true for any operation which is volatile or at least |
| 372 | Monotonic. |
| 373 | |
| 374 | * Alias analysis: Note that AA will return ModRef for anything Acquire or |
| 375 | Release, and for the address accessed by any Monotonic operation. |
| 376 | |
| 377 | To support optimizing around atomic operations, make sure you are using the |
| 378 | right predicates; everything should work if that is done. If your pass should |
| 379 | optimize some atomic operations (Unordered operations in particular), make sure |
| 380 | it doesn't replace an atomic load or store with a non-atomic operation. |
| 381 | |
| 382 | Some examples of how optimizations interact with various kinds of atomic |
| 383 | operations: |
| 384 | |
| 385 | * ``memcpyopt``: An atomic operation cannot be optimized into part of a |
| 386 | memcpy/memset, including unordered loads/stores. It can pull operations |
| 387 | across some atomic operations. |
| 388 | |
| 389 | * LICM: Unordered loads/stores can be moved out of a loop. It just treats |
| 390 | monotonic operations like a read+write to a memory location, and anything |
| 391 | stricter than that like a nothrow call. |
| 392 | |
| 393 | * DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can |
| 394 | be DSE'ed in some cases, but it's tricky to reason about, and not especially |
| 395 | important. |
| 396 | |
| 397 | * Folding a load: Any atomic load from a constant global can be constant-folded, |
| 398 | because it cannot be observed. Similar reasoning allows scalarrepl with |
| 399 | atomic loads and stores. |
| 400 | |
| 401 | Atomics and Codegen |
| 402 | =================== |
| 403 | |
| 404 | Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes. |
| 405 | On architectures which use barrier instructions for all atomic ordering (like |
| 406 | ARM), appropriate fences are split out as the DAG is built. |
| 407 | |
| 408 | The MachineMemOperand for all atomic operations is currently marked as volatile; |
| 409 | this is not correct in the IR sense of volatile, but CodeGen handles anything |
| 410 | marked volatile very conservatively. This should get fixed at some point. |
| 411 | |
| 412 | Common architectures have some way of representing at least a pointer-sized |
| 413 | lock-free ``cmpxchg``; such an operation can be used to implement all the other |
| 414 | atomic operations which can be represented in IR up to that size. Backends are |
| 415 | expected to implement all those operations, but not operations which cannot be |
| 416 | implemented in a lock-free manner. It is expected that backends will give an |
| 417 | error when given an operation which cannot be implemented. (The LLVM code |
| 418 | generator is not very helpful here at the moment, but hopefully that will |
| 419 | change.) |
| 420 | |
| 421 | The implementation of atomics on LL/SC architectures (like ARM) is currently a |
| 422 | bit of a mess; there is a lot of copy-pasted code across targets, and the |
| 423 | representation is relatively unsuited to optimization (it would be nice to be |
| 424 | able to optimize loops involving cmpxchg etc.). |
| 425 | |
| 426 | On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores |
| 427 | generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent |
| 428 | fences generate an ``MFENCE``, other fences do not cause any code to be |
| 429 | generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg`` |
| 430 | uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all |
| 431 | other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending |
| 432 | on the users of the result, some ``atomicrmw`` operations can be translated into |
| 433 | operations like ``LOCK AND``, but that does not work in general. |
| 434 | |
| 435 | On ARM, MIPS, and many other RISC architectures, Acquire, Release, and |
| 436 | SequentiallyConsistent semantics require barrier instructions for every such |
| 437 | operation. Loads and stores generate normal instructions. ``cmpxchg`` and |
| 438 | ``atomicrmw`` can be represented using a loop with LL/SC-style instructions |
| 439 | which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX`` |
| 440 | on ARM, etc.). At the moment, the IR does not provide any way to represent a |
| 441 | weak ``cmpxchg`` which would not require a loop. |