Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame^] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| 2 | "http://www.w3.org/TR/html4/strict.dtd"> |
| 3 | <html> |
| 4 | <head> |
| 5 | <title>LLVM Atomic Instructions and Concurrency Guide</title> |
| 6 | <link rel="stylesheet" href="llvm.css" type="text/css"> |
| 7 | </head> |
| 8 | <body> |
| 9 | |
| 10 | <h1> |
| 11 | LLVM Atomic Instructions and Concurrency Guide |
| 12 | </h1> |
| 13 | |
| 14 | <ol> |
| 15 | <li><a href="#introduction">Introduction</a></li> |
| 16 | <li><a href="#loadstore">Load and store</a></li> |
| 17 | <li><a href="#ordering">Atomic orderings</a></li> |
| 18 | <li><a href="#otherinst">Other atomic instructions</a></li> |
| 19 | <li><a href="#iropt">Atomics and IR optimization</a></li> |
| 20 | <li><a href="#codegen">Atomics and Codegen</a></li> |
| 21 | </ol> |
| 22 | |
| 23 | <div class="doc_author"> |
| 24 | <p>Written by Eli Friedman</p> |
| 25 | </div> |
| 26 | |
| 27 | <!-- *********************************************************************** --> |
| 28 | <h2> |
| 29 | <a name="introduction">Introduction</a> |
| 30 | </h2> |
| 31 | <!-- *********************************************************************** --> |
| 32 | |
| 33 | <div> |
| 34 | |
| 35 | <p>Historically, LLVM has not had very strong support for concurrency; some |
| 36 | minimal intrinsics were provided, and <code>volatile</code> was used in some |
| 37 | cases to achieve rough semantics in the presence of concurrency. However, this |
| 38 | is changing; there are now new instructions which are well-defined in the |
| 39 | presence of threads and asynchronous signals, and the model for existing |
| 40 | instructions has been clarified in the IR.</p> |
| 41 | |
| 42 | <p>The atomic instructions are designed specifically to provide readable IR and |
| 43 | optimized code generation for the following:</p> |
| 44 | <ul> |
| 45 | <li>The new C++0x <code><atomic></code> header.</li> |
| 46 | <li>Proper semantics for Java-style memory, for both <code>volatile</code> and |
| 47 | regular shared variables.</li> |
| 48 | <li>gcc-compatible <code>__sync_*</code> builtins.</li> |
| 49 | <li>Other scenarios with atomic semantics, including <code>static</code> |
| 50 | variables with non-trivial constructors in C++.</li> |
| 51 | </ul> |
| 52 | |
| 53 | <p>This document is intended to provide a guide to anyone either writing a |
| 54 | frontend for LLVM or working on optimization passes for LLVM with a guide |
| 55 | for how to deal with instructions with special semantics in the presence of |
| 56 | concurrency. This is not intended to be a precise guide to the semantics; |
| 57 | the details can get extremely complicated and unreadable, and are not |
| 58 | usually necessary.</p> |
| 59 | |
| 60 | </div> |
| 61 | |
| 62 | <!-- *********************************************************************** --> |
| 63 | <h2> |
| 64 | <a name="loadstore">Load and store</a> |
| 65 | </h2> |
| 66 | <!-- *********************************************************************** --> |
| 67 | |
| 68 | <div> |
| 69 | |
| 70 | <p>The basic <code>'load'</code> and <code>'store'</code> allow a variety of |
| 71 | optimizations, but can have unintuitive results in a concurrent environment. |
| 72 | For a frontend writer, the rule is essentially that all memory accessed |
| 73 | with basic loads and stores by multiple threads should be protected by a |
| 74 | lock or other synchronization; otherwise, you are likely to run into |
| 75 | undefined behavior. (Do not use volatile as a substitute for atomics; it |
| 76 | might work on some platforms, but does not provide the necessary guarantees |
| 77 | in general.)</p> |
| 78 | |
| 79 | <p>From the optimizer's point of view, the rule is that if there |
| 80 | are not any instructions with atomic ordering involved, concurrency does not |
| 81 | matter, with one exception: if a variable might be visible to another |
| 82 | thread or signal handler, a store cannot be inserted along a path where it |
| 83 | might not execute otherwise. Note that speculative loads are allowed; |
| 84 | a load which is part of a race returns <code>undef</code>, but is not |
| 85 | undefined behavior.</p> |
| 86 | |
| 87 | <p>For cases where simple loads and stores are not sufficient, LLVM provides |
| 88 | atomic loads and stores with varying levels of guarantees.</p> |
| 89 | |
| 90 | </div> |
| 91 | |
| 92 | <!-- *********************************************************************** --> |
| 93 | <h2> |
| 94 | <a name="ordering">Atomic orderings</a> |
| 95 | </h2> |
| 96 | <!-- *********************************************************************** --> |
| 97 | |
| 98 | <div> |
| 99 | |
| 100 | <p>In order to achieve a balance between performance and necessary guarantees, |
| 101 | there are six levels of atomicity. They are listed in order of strength; |
| 102 | each level includes all the guarantees of the previous level except for |
| 103 | Acquire/Release.</p> |
| 104 | |
| 105 | <p>Unordered is the lowest level of atomicity. It essentially guarantees that |
| 106 | races produce somewhat sane results instead of having undefined behavior. |
| 107 | This is intended to match the Java memory model for shared variables. It |
| 108 | cannot be used for synchronization, but is useful for Java and other |
| 109 | "safe" languages which need to guarantee that the generated code never |
| 110 | exhibits undefined behavior. Note that this guarantee is cheap on common |
| 111 | platforms for loads of a native width, but can be expensive or unavailable |
| 112 | for wider loads, like a 64-bit load on ARM. (A frontend for a "safe" |
| 113 | language would normally split a 64-bit load on ARM into two 32-bit |
| 114 | unordered loads.) In terms of the optimizer, this prohibits any |
| 115 | transformation that transforms a single load into multiple loads, |
| 116 | transforms a store into multiple stores, narrows a store, or stores a |
| 117 | value which would not be stored otherwise. Some examples of unsafe |
| 118 | optimizations are narrowing an assignment into a bitfield, rematerializing |
| 119 | a load, and turning loads and stores into a memcpy call. Reordering |
| 120 | unordered operations is safe, though, and optimizers should take |
| 121 | advantage of that because unordered operations are common in |
| 122 | languages that need them.</p> |
| 123 | |
| 124 | <p>Monotonic is the weakest level of atomicity that can be used in |
| 125 | synchronization primitives, although it does not provide any general |
| 126 | synchronization. It essentially guarantees that if you take all the |
| 127 | operations affecting a specific address, a consistent ordering exists. |
| 128 | This corresponds to the C++0x/C1x <code>memory_order_relaxed</code>; see |
| 129 | those standards for the exact definition. If you are writing a frontend, do |
| 130 | not use the low-level synchronization primitives unless you are compiling |
| 131 | a language which requires it or are sure a given pattern is correct. In |
| 132 | terms of the optimizer, this can be treated as a read+write on the relevant |
| 133 | memory location (and alias analysis will take advantage of that). In |
| 134 | addition, it is legal to reorder non-atomic and Unordered loads around |
| 135 | Monotonic loads. CSE/DSE and a few other optimizations are allowed, but |
| 136 | Monotonic operations are unlikely to be used in ways which would make |
| 137 | those optimizations useful.</p> |
| 138 | |
| 139 | <p>Acquire provides a barrier of the sort necessary to acquire a lock to access |
| 140 | other memory with normal loads and stores. This corresponds to the |
| 141 | C++0x/C1x <code>memory_order_acquire</code>. This is a low-level |
| 142 | synchronization primitive. In general, optimizers should treat this like |
| 143 | a nothrow call.</p> |
| 144 | |
| 145 | <p>Release is similar to Acquire, but with a barrier of the sort necessary to |
| 146 | release a lock.This corresponds to the C++0x/C1x |
| 147 | <code>memory_order_release</code>.</p> |
| 148 | |
| 149 | <p>AcquireRelease (<code>acq_rel</code> in IR) provides both an Acquire and a Release barrier. |
| 150 | This corresponds to the C++0x/C1x <code>memory_order_acq_rel</code>. In general, |
| 151 | optimizers should treat this like a nothrow call.</p> |
| 152 | |
| 153 | <p>SequentiallyConsistent (<code>seq_cst</code> in IR) provides Acquire and/or |
| 154 | Release semantics, and in addition guarantees a total ordering exists with |
| 155 | all other SequentiallyConsistent operations. This corresponds to the |
| 156 | C++0x/C1x <code>memory_order_seq_cst</code>, and Java volatile. The intent |
| 157 | of this ordering level is to provide a programming model which is relatively |
| 158 | easy to understand. In general, optimizers should treat this like a |
| 159 | nothrow call.</p> |
| 160 | |
| 161 | </div> |
| 162 | |
| 163 | <!-- *********************************************************************** --> |
| 164 | <h2> |
| 165 | <a name="otherinst">Other atomic instructions</a> |
| 166 | </h2> |
| 167 | <!-- *********************************************************************** --> |
| 168 | |
| 169 | <div> |
| 170 | |
| 171 | <p><code>cmpxchg</code> and <code>atomicrmw</code> are essentially like an |
| 172 | atomic load followed by an atomic store (where the store is conditional for |
| 173 | <code>cmpxchg</code>), but no other memory operation operation can happen |
| 174 | between the load and store.</p> |
| 175 | |
| 176 | <p>A <code>fence</code> provides Acquire and/or Release ordering which is not |
| 177 | part of another operation; it is normally used along with Monotonic memory |
| 178 | operations. A Monotonic load followed by an Acquire fence is roughly |
| 179 | equivalent to an Acquire load.</p> |
| 180 | |
| 181 | <p>Frontends generating atomic instructions generally need to be aware of the |
| 182 | target to some degree; atomic instructions are guaranteed to be lock-free, |
| 183 | and therefore an instruction which is wider than the target natively supports |
| 184 | can be impossible to generate.</p> |
| 185 | |
| 186 | </div> |
| 187 | |
| 188 | <!-- *********************************************************************** --> |
| 189 | <h2> |
| 190 | <a name="iropt">Atomics and IR optimization</a> |
| 191 | </h2> |
| 192 | <!-- *********************************************************************** --> |
| 193 | |
| 194 | <div> |
| 195 | |
| 196 | <p>Predicates for optimizer writers to query: |
| 197 | <ul> |
| 198 | <li>isSimple(): A load or store which is not volatile or atomic. This is |
| 199 | what, for example, memcpyopt would check for operations it might |
| 200 | transform. |
| 201 | <li>isUnordered(): A load or store which is not volatile and at most |
| 202 | Unordered. This would be checked, for example, by LICM before hoisting |
| 203 | an operation. |
| 204 | <li>mayReadFromMemory()/mayWriteToMemory(): Existing predicate, but note |
| 205 | that they returns true for any operation which is volatile or at least |
| 206 | Monotonic. |
| 207 | <li>Alias analysis: Note that AA will return ModRef for anything Acquire or |
| 208 | Release, and for the address accessed by any Monotonic operation. |
| 209 | </ul> |
| 210 | |
| 211 | <p>There are essentially two components to supporting atomic operations. The |
| 212 | first is making sure to query isSimple() or isUnordered() instead |
| 213 | of isVolatile() before transforming an operation. The other piece is |
| 214 | making sure that a transform does not end up replacing, for example, an |
| 215 | Unordered operation with a non-atomic operation. Most of the other |
| 216 | necessary checks automatically fall out from existing predicates and |
| 217 | alias analysis queries.</p> |
| 218 | |
| 219 | <p>Some examples of how optimizations interact with various kinds of atomic |
| 220 | operations: |
| 221 | <ul> |
| 222 | <li>memcpyopt: An atomic operation cannot be optimized into part of a |
| 223 | memcpy/memset, including unordered loads/stores. It can pull operations |
| 224 | across some atomic operations. |
| 225 | <li>LICM: Unordered loads/stores can be moved out of a loop. It just treats |
| 226 | monotonic operations like a read+write to a memory location, and anything |
| 227 | stricter than that like a nothrow call. |
| 228 | <li>DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores |
| 229 | can be DSE'ed in some cases, but it's tricky to reason about, and not |
| 230 | especially important. |
| 231 | <li>Folding a load: Any atomic load from a constant global can be |
| 232 | constant-folded, because it cannot be observed. Similar reasoning allows |
| 233 | scalarrepl with atomic loads and stores. |
| 234 | </ul> |
| 235 | |
| 236 | </div> |
| 237 | |
| 238 | <!-- *********************************************************************** --> |
| 239 | <h2> |
| 240 | <a name="codegen">Atomics and Codegen</a> |
| 241 | </h2> |
| 242 | <!-- *********************************************************************** --> |
| 243 | |
| 244 | <div> |
| 245 | |
| 246 | <p>Atomic operations are represented in the SelectionDAG with |
| 247 | <code>ATOMIC_*</code> opcodes. On architectures which use barrier |
| 248 | instructions for all atomic ordering (like ARM), appropriate fences are |
| 249 | split out as the DAG is built.</p> |
| 250 | |
| 251 | <p>The MachineMemOperand for all atomic operations is currently marked as |
| 252 | volatile; this is not correct in the IR sense of volatile, but CodeGen |
| 253 | handles anything marked volatile very conservatively. This should get |
| 254 | fixed at some point.</p> |
| 255 | |
| 256 | <p>The implementation of atomics on LL/SC architectures (like ARM) is currently |
| 257 | a bit of a mess; there is a lot of copy-pasted code across targets, and |
| 258 | the representation is relatively unsuited to optimization (it would be nice |
| 259 | to be able to optimize loops involving cmpxchg etc.).</p> |
| 260 | |
| 261 | <p>On x86, all atomic loads generate a <code>MOV</code>. |
| 262 | SequentiallyConsistent stores generate an <code>XCHG</code>, other stores |
| 263 | generate a <code>MOV</code>. SequentiallyConsistent fences generate an |
| 264 | <code>MFENCE</code>, other fences do not cause any code to be generated. |
| 265 | cmpxchg uses the <code>LOCK CMPXCHG</code> instruction. |
| 266 | <code>atomicrmw xchg</code> uses <code>XCHG</code>, |
| 267 | <code>atomicrmw add</code> and <code>atomicrmw sub</code> use |
| 268 | <code>XADD</code>, and all other <code>atomicrmw</code> operations generate |
| 269 | a loop with <code>LOCK CMPXCHG</code>. Depending on the users of the |
| 270 | result, some <code>atomicrmw</code> operations can be translated into |
| 271 | operations like <code>LOCK AND</code>, but that does not work in |
| 272 | general.</p> |
| 273 | |
| 274 | <p>On ARM, MIPS, and many other RISC architectures, Acquire, Release, and |
| 275 | SequentiallyConsistent semantics require barrier instructions |
| 276 | for every such operation. Loads and stores generate normal instructions. |
| 277 | <code>atomicrmw</code> and <code>cmpxchg</code> generate LL/SC loops.</p> |
| 278 | |
| 279 | </div> |
| 280 | |
| 281 | <!-- *********************************************************************** --> |
| 282 | |
| 283 | <hr> |
| 284 | <address> |
| 285 | <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| 286 | src="http://jigsaw.w3.org/css-validator/images/vcss-blue" alt="Valid CSS"></a> |
| 287 | <a href="http://validator.w3.org/check/referer"><img |
| 288 | src="http://www.w3.org/Icons/valid-html401-blue" alt="Valid HTML 4.01"></a> |
| 289 | |
| 290 | <a href="http://llvm.org/">LLVM Compiler Infrastructure</a><br> |
| 291 | Last modified: $Date: 2011-08-09 02:07:00 -0700 (Tue, 09 Aug 2011) $ |
| 292 | </address> |
| 293 | |
| 294 | </body> |
| 295 | </html> |