Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| 2 | "http://www.w3.org/TR/html4/strict.dtd"> |
| 3 | <html> |
| 4 | <head> |
| 5 | <title>LLVM Atomic Instructions and Concurrency Guide</title> |
Eli Friedman | 21006d4 | 2011-08-09 23:02:53 +0000 | [diff] [blame] | 6 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 7 | <link rel="stylesheet" href="llvm.css" type="text/css"> |
| 8 | </head> |
| 9 | <body> |
| 10 | |
| 11 | <h1> |
| 12 | LLVM Atomic Instructions and Concurrency Guide |
| 13 | </h1> |
| 14 | |
| 15 | <ol> |
| 16 | <li><a href="#introduction">Introduction</a></li> |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 17 | <li><a href="#outsideatomic">Optimization outside atomic</a></li> |
| 18 | <li><a href="#atomicinst">Atomic instructions</a></li> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 19 | <li><a href="#ordering">Atomic orderings</a></li> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 20 | <li><a href="#iropt">Atomics and IR optimization</a></li> |
| 21 | <li><a href="#codegen">Atomics and Codegen</a></li> |
| 22 | </ol> |
| 23 | |
| 24 | <div class="doc_author"> |
| 25 | <p>Written by Eli Friedman</p> |
| 26 | </div> |
| 27 | |
| 28 | <!-- *********************************************************************** --> |
| 29 | <h2> |
| 30 | <a name="introduction">Introduction</a> |
| 31 | </h2> |
| 32 | <!-- *********************************************************************** --> |
| 33 | |
| 34 | <div> |
| 35 | |
| 36 | <p>Historically, LLVM has not had very strong support for concurrency; some |
| 37 | minimal intrinsics were provided, and <code>volatile</code> was used in some |
| 38 | cases to achieve rough semantics in the presence of concurrency. However, this |
| 39 | is changing; there are now new instructions which are well-defined in the |
| 40 | presence of threads and asynchronous signals, and the model for existing |
| 41 | instructions has been clarified in the IR.</p> |
| 42 | |
| 43 | <p>The atomic instructions are designed specifically to provide readable IR and |
| 44 | optimized code generation for the following:</p> |
| 45 | <ul> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 46 | <li>The new C++0x <code><atomic></code> header. |
| 47 | (<a href="http://www.open-std.org/jtc1/sc22/wg21/">C++0x draft available here</a>.) |
| 48 | (<a href="http://www.open-std.org/jtc1/sc22/wg14/">C1x draft available here</a>)</li> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 49 | <li>Proper semantics for Java-style memory, for both <code>volatile</code> and |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 50 | regular shared variables. |
| 51 | (<a href="http://java.sun.com/docs/books/jls/third_edition/html/memory.html">Java Specification</a>)</li> |
| 52 | <li>gcc-compatible <code>__sync_*</code> builtins. |
| 53 | (<a href="http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html">Description</a>)</li> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 54 | <li>Other scenarios with atomic semantics, including <code>static</code> |
| 55 | variables with non-trivial constructors in C++.</li> |
| 56 | </ul> |
| 57 | |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 58 | <p>Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ |
| 59 | volatile, which ensures that every volatile load and store happens and is |
| 60 | performed in the stated order. A couple examples: if a |
| 61 | SequentiallyConsistent store is immediately followed by another |
| 62 | SequentiallyConsistent store to the same address, the first store can |
| 63 | be erased. This transformation is not allowed for a pair of volatile |
| 64 | stores. On the other hand, a non-volatile non-atomic load can be moved |
| 65 | across a volatile load freely, but not an Acquire load.</p> |
| 66 | |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 67 | <p>This document is intended to provide a guide to anyone either writing a |
| 68 | frontend for LLVM or working on optimization passes for LLVM with a guide |
| 69 | for how to deal with instructions with special semantics in the presence of |
| 70 | concurrency. This is not intended to be a precise guide to the semantics; |
| 71 | the details can get extremely complicated and unreadable, and are not |
| 72 | usually necessary.</p> |
| 73 | |
| 74 | </div> |
| 75 | |
| 76 | <!-- *********************************************************************** --> |
| 77 | <h2> |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 78 | <a name="outsideatomic">Optimization outside atomic</a> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 79 | </h2> |
| 80 | <!-- *********************************************************************** --> |
| 81 | |
| 82 | <div> |
| 83 | |
| 84 | <p>The basic <code>'load'</code> and <code>'store'</code> allow a variety of |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 85 | optimizations, but can lead to undefined results in a concurrent environment; |
| 86 | see <a href="#o_nonatomic">NonAtomic</a>. This section specifically goes |
| 87 | into the one optimizer restriction which applies in concurrent environments, |
| 88 | which gets a bit more of an extended description because any optimization |
| 89 | dealing with stores needs to be aware of it.</p> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 90 | |
| 91 | <p>From the optimizer's point of view, the rule is that if there |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 92 | are not any instructions with atomic ordering involved, concurrency does |
| 93 | not matter, with one exception: if a variable might be visible to another |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 94 | thread or signal handler, a store cannot be inserted along a path where it |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 95 | might not execute otherwise. Take the following example:</p> |
| 96 | |
| 97 | <pre> |
| 98 | /* C code, for readability; run through clang -O2 -S -emit-llvm to get |
| 99 | equivalent IR */ |
| 100 | int x; |
| 101 | void f(int* a) { |
| 102 | for (int i = 0; i < 100; i++) { |
| 103 | if (a[i]) |
| 104 | x += 1; |
| 105 | } |
| 106 | } |
| 107 | </pre> |
| 108 | |
| 109 | <p>The following is equivalent in non-concurrent situations:</p> |
| 110 | |
| 111 | <pre> |
| 112 | int x; |
| 113 | void f(int* a) { |
| 114 | int xtemp = x; |
| 115 | for (int i = 0; i < 100; i++) { |
| 116 | if (a[i]) |
| 117 | xtemp += 1; |
| 118 | } |
| 119 | x = xtemp; |
| 120 | } |
| 121 | </pre> |
| 122 | |
| 123 | <p>However, LLVM is not allowed to transform the former to the latter: it could |
Eli Friedman | 234bccd | 2011-08-22 21:35:27 +0000 | [diff] [blame] | 124 | indirectly introduce undefined behavior if another thread can access x at |
| 125 | the same time. (This example is particularly of interest because before the |
| 126 | concurrency model was implemented, LLVM would perform this |
| 127 | transformation.)</p> |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 128 | |
| 129 | <p>Note that speculative loads are allowed; a load which |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 130 | is part of a race returns <code>undef</code>, but does not have undefined |
| 131 | behavior.</p> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 132 | |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 133 | |
| 134 | </div> |
| 135 | |
| 136 | <!-- *********************************************************************** --> |
| 137 | <h2> |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 138 | <a name="atomicinst">Atomic instructions</a> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 139 | </h2> |
| 140 | <!-- *********************************************************************** --> |
| 141 | |
| 142 | <div> |
| 143 | |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 144 | <p>For cases where simple loads and stores are not sufficient, LLVM provides |
| 145 | various atomic instructions. The exact guarantees provided depend on the |
| 146 | ordering; see <a href="#ordering">Atomic orderings</a></p> |
| 147 | |
| 148 | <p><code>load atomic</code> and <code>store atomic</code> provide the same |
| 149 | basic functionality as non-atomic loads and stores, but provide additional |
| 150 | guarantees in situations where threads and signals are involved.</p> |
| 151 | |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 152 | <p><code>cmpxchg</code> and <code>atomicrmw</code> are essentially like an |
| 153 | atomic load followed by an atomic store (where the store is conditional for |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 154 | <code>cmpxchg</code>), but no other memory operation can happen on any thread |
| 155 | between the load and store. Note that LLVM's cmpxchg does not provide quite |
| 156 | as many options as the C++0x version.</p> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 157 | |
| 158 | <p>A <code>fence</code> provides Acquire and/or Release ordering which is not |
| 159 | part of another operation; it is normally used along with Monotonic memory |
| 160 | operations. A Monotonic load followed by an Acquire fence is roughly |
| 161 | equivalent to an Acquire load.</p> |
| 162 | |
| 163 | <p>Frontends generating atomic instructions generally need to be aware of the |
| 164 | target to some degree; atomic instructions are guaranteed to be lock-free, |
| 165 | and therefore an instruction which is wider than the target natively supports |
| 166 | can be impossible to generate.</p> |
| 167 | |
| 168 | </div> |
| 169 | |
| 170 | <!-- *********************************************************************** --> |
| 171 | <h2> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 172 | <a name="ordering">Atomic orderings</a> |
| 173 | </h2> |
| 174 | <!-- *********************************************************************** --> |
| 175 | |
| 176 | <div> |
| 177 | |
| 178 | <p>In order to achieve a balance between performance and necessary guarantees, |
| 179 | there are six levels of atomicity. They are listed in order of strength; |
| 180 | each level includes all the guarantees of the previous level except for |
Eli Friedman | 234bccd | 2011-08-22 21:35:27 +0000 | [diff] [blame] | 181 | Acquire/Release. (See also <a href="LangRef.html#ordering">LangRef</a>.)</p> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 182 | |
| 183 | <!-- ======================================================================= --> |
| 184 | <h3> |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 185 | <a name="o_notatomic">NotAtomic</a> |
| 186 | </h3> |
| 187 | |
| 188 | <div> |
| 189 | |
| 190 | <p>NotAtomic is the obvious, a load or store which is not atomic. (This isn't |
| 191 | really a level of atomicity, but is listed here for comparison.) This is |
Eli Friedman | 234bccd | 2011-08-22 21:35:27 +0000 | [diff] [blame] | 192 | essentially a regular load or store. If there is a race on a given memory |
| 193 | location, loads from that location return undef.</p> |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 194 | |
| 195 | <dl> |
| 196 | <dt>Relevant standard</dt> |
| 197 | <dd>This is intended to match shared variables in C/C++, and to be used |
| 198 | in any other context where memory access is necessary, and |
Eli Friedman | 234bccd | 2011-08-22 21:35:27 +0000 | [diff] [blame] | 199 | a race is impossible. (The precise definition is in |
| 200 | <a href="LangRef.html#memmodel">LangRef</a>.) |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 201 | <dt>Notes for frontends</dt> |
| 202 | <dd>The rule is essentially that all memory accessed with basic loads and |
| 203 | stores by multiple threads should be protected by a lock or other |
| 204 | synchronization; otherwise, you are likely to run into undefined |
| 205 | behavior. If your frontend is for a "safe" language like Java, |
| 206 | use Unordered to load and store any shared variable. Note that NotAtomic |
| 207 | volatile loads and stores are not properly atomic; do not try to use |
| 208 | them as a substitute. (Per the C/C++ standards, volatile does provide |
| 209 | some limited guarantees around asynchronous signals, but atomics are |
| 210 | generally a better solution.) |
| 211 | <dt>Notes for optimizers</dt> |
| 212 | <dd>Introducing loads to shared variables along a codepath where they would |
| 213 | not otherwise exist is allowed; introducing stores to shared variables |
| 214 | is not. See <a href="#outsideatomic">Optimization outside |
| 215 | atomic</a>.</dd> |
| 216 | <dt>Notes for code generation</dt> |
| 217 | <dd>The one interesting restriction here is that it is not allowed to write |
| 218 | to bytes outside of the bytes relevant to a store. This is mostly |
| 219 | relevant to unaligned stores: it is not allowed in general to convert |
| 220 | an unaligned store into two aligned stores of the same width as the |
| 221 | unaligned store. Backends are also expected to generate an i8 store |
| 222 | as an i8 store, and not an instruction which writes to surrounding |
| 223 | bytes. (If you are writing a backend for an architecture which cannot |
| 224 | satisfy these restrictions and cares about concurrency, please send an |
| 225 | email to llvmdev.)</dd> |
| 226 | </dl> |
| 227 | |
| 228 | </div> |
| 229 | |
| 230 | |
| 231 | <!-- ======================================================================= --> |
| 232 | <h3> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 233 | <a name="o_unordered">Unordered</a> |
| 234 | </h3> |
| 235 | |
| 236 | <div> |
| 237 | |
| 238 | <p>Unordered is the lowest level of atomicity. It essentially guarantees that |
| 239 | races produce somewhat sane results instead of having undefined behavior. |
| 240 | It also guarantees the operation to be lock-free, so it do not depend on |
| 241 | the data being part of a special atomic structure or depend on a separate |
| 242 | per-process global lock. Note that code generation will fail for |
| 243 | unsupported atomic operations; if you need such an operation, use explicit |
| 244 | locking.</p> |
| 245 | |
| 246 | <dl> |
| 247 | <dt>Relevant standard</dt> |
| 248 | <dd>This is intended to match the Java memory model for shared |
| 249 | variables.</dd> |
| 250 | <dt>Notes for frontends</dt> |
| 251 | <dd>This cannot be used for synchronization, but is useful for Java and |
| 252 | other "safe" languages which need to guarantee that the generated |
| 253 | code never exhibits undefined behavior. Note that this guarantee |
| 254 | is cheap on common platforms for loads of a native width, but can |
| 255 | be expensive or unavailable for wider loads, like a 64-bit store |
| 256 | on ARM. (A frontend for Java or other "safe" languages would normally |
| 257 | split a 64-bit store on ARM into two 32-bit unordered stores.) |
| 258 | <dt>Notes for optimizers</dt> |
| 259 | <dd>In terms of the optimizer, this prohibits any transformation that |
| 260 | transforms a single load into multiple loads, transforms a store |
| 261 | into multiple stores, narrows a store, or stores a value which |
| 262 | would not be stored otherwise. Some examples of unsafe optimizations |
| 263 | are narrowing an assignment into a bitfield, rematerializing |
| 264 | a load, and turning loads and stores into a memcpy call. Reordering |
| 265 | unordered operations is safe, though, and optimizers should take |
| 266 | advantage of that because unordered operations are common in |
| 267 | languages that need them.</dd> |
| 268 | <dt>Notes for code generation</dt> |
| 269 | <dd>These operations are required to be atomic in the sense that if you |
| 270 | use unordered loads and unordered stores, a load cannot see a value |
| 271 | which was never stored. A normal load or store instruction is usually |
| 272 | sufficient, but note that an unordered load or store cannot |
| 273 | be split into multiple instructions (or an instruction which |
| 274 | does multiple memory operations, like <code>LDRD</code> on ARM).</dd> |
| 275 | </dl> |
| 276 | |
| 277 | </div> |
| 278 | |
| 279 | <!-- ======================================================================= --> |
| 280 | <h3> |
| 281 | <a name="o_monotonic">Monotonic</a> |
| 282 | </h3> |
| 283 | |
| 284 | <div> |
| 285 | |
| 286 | <p>Monotonic is the weakest level of atomicity that can be used in |
| 287 | synchronization primitives, although it does not provide any general |
| 288 | synchronization. It essentially guarantees that if you take all the |
| 289 | operations affecting a specific address, a consistent ordering exists. |
| 290 | |
| 291 | <dl> |
| 292 | <dt>Relevant standard</dt> |
| 293 | <dd>This corresponds to the C++0x/C1x <code>memory_order_relaxed</code>; |
| 294 | see those standards for the exact definition. |
| 295 | <dt>Notes for frontends</dt> |
| 296 | <dd>If you are writing a frontend which uses this directly, use with caution. |
| 297 | The guarantees in terms of synchronization are very weak, so make |
| 298 | sure these are only used in a pattern which you know is correct. |
| 299 | Generally, these would either be used for atomic operations which |
| 300 | do not protect other memory (like an atomic counter), or along with |
| 301 | a <code>fence</code>.</dd> |
| 302 | <dt>Notes for optimizers</dt> |
| 303 | <dd>In terms of the optimizer, this can be treated as a read+write on the |
| 304 | relevant memory location (and alias analysis will take advantage of |
| 305 | that). In addition, it is legal to reorder non-atomic and Unordered |
| 306 | loads around Monotonic loads. CSE/DSE and a few other optimizations |
| 307 | are allowed, but Monotonic operations are unlikely to be used in ways |
| 308 | which would make those optimizations useful.</dd> |
| 309 | <dt>Notes for code generation</dt> |
| 310 | <dd>Code generation is essentially the same as that for unordered for loads |
Nick Lewycky | 5366ca4 | 2011-09-11 15:30:08 +0000 | [diff] [blame] | 311 | and stores. No fences are required. <code>cmpxchg</code> and |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 312 | <code>atomicrmw</code> are required to appear as a single operation.</dd> |
| 313 | </dl> |
| 314 | |
| 315 | </div> |
| 316 | |
| 317 | <!-- ======================================================================= --> |
| 318 | <h3> |
| 319 | <a name="o_acquire">Acquire</a> |
| 320 | </h3> |
| 321 | |
| 322 | <div> |
| 323 | |
| 324 | <p>Acquire provides a barrier of the sort necessary to acquire a lock to access |
| 325 | other memory with normal loads and stores. |
| 326 | |
| 327 | <dl> |
| 328 | <dt>Relevant standard</dt> |
| 329 | <dd>This corresponds to the C++0x/C1x <code>memory_order_acquire</code>. It |
| 330 | should also be used for C++0x/C1x <code>memory_order_consume</code>. |
| 331 | <dt>Notes for frontends</dt> |
| 332 | <dd>If you are writing a frontend which uses this directly, use with caution. |
| 333 | Acquire only provides a semantic guarantee when paired with a Release |
| 334 | operation.</dd> |
| 335 | <dt>Notes for optimizers</dt> |
Eli Friedman | 79d7de7 | 2011-08-12 03:38:32 +0000 | [diff] [blame] | 336 | <dd>Optimizers not aware of atomics can treat this like a nothrow call. |
Chris Lattner | 9a5ffbf | 2011-08-12 19:48:19 +0000 | [diff] [blame] | 337 | It is also possible to move stores from before an Acquire load |
Eli Friedman | 79d7de7 | 2011-08-12 03:38:32 +0000 | [diff] [blame] | 338 | or read-modify-write operation to after it, and move non-Acquire |
| 339 | loads from before an Acquire operation to after it.</dd> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 340 | <dt>Notes for code generation</dt> |
| 341 | <dd>Architectures with weak memory ordering (essentially everything relevant |
| 342 | today except x86 and SPARC) require some sort of fence to maintain |
| 343 | the Acquire semantics. The precise fences required varies widely by |
| 344 | architecture, but for a simple implementation, most architectures provide |
| 345 | a barrier which is strong enough for everything (<code>dmb</code> on ARM, |
| 346 | <code>sync</code> on PowerPC, etc.). Putting such a fence after the |
| 347 | equivalent Monotonic operation is sufficient to maintain Acquire |
| 348 | semantics for a memory operation.</dd> |
| 349 | </dl> |
| 350 | |
| 351 | </div> |
| 352 | |
| 353 | <!-- ======================================================================= --> |
| 354 | <h3> |
| 355 | <a name="o_acquire">Release</a> |
| 356 | </h3> |
| 357 | |
| 358 | <div> |
| 359 | |
| 360 | <p>Release is similar to Acquire, but with a barrier of the sort necessary to |
| 361 | release a lock. |
| 362 | |
| 363 | <dl> |
| 364 | <dt>Relevant standard</dt> |
| 365 | <dd>This corresponds to the C++0x/C1x <code>memory_order_release</code>.</dd> |
| 366 | <dt>Notes for frontends</dt> |
| 367 | <dd>If you are writing a frontend which uses this directly, use with caution. |
| 368 | Release only provides a semantic guarantee when paired with a Acquire |
| 369 | operation.</dd> |
| 370 | <dt>Notes for optimizers</dt> |
Eli Friedman | 79d7de7 | 2011-08-12 03:38:32 +0000 | [diff] [blame] | 371 | <dd>Optimizers not aware of atomics can treat this like a nothrow call. |
| 372 | It is also possible to move loads from after a Release store |
| 373 | or read-modify-write operation to before it, and move non-Release |
| 374 | stores from after an Release operation to before it.</dd> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 375 | <dt>Notes for code generation</dt> |
Eli Friedman | d577a06 | 2011-08-12 01:26:06 +0000 | [diff] [blame] | 376 | <dd>See the section on Acquire; a fence before the relevant operation is |
Eli Friedman | 79d7de7 | 2011-08-12 03:38:32 +0000 | [diff] [blame] | 377 | usually sufficient for Release. Note that a store-store fence is not |
Eli Friedman | d577a06 | 2011-08-12 01:26:06 +0000 | [diff] [blame] | 378 | sufficient to implement Release semantics; store-store fences are |
| 379 | generally not exposed to IR because they are extremely difficult to |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 380 | use correctly.</dd> |
| 381 | </dl> |
| 382 | |
| 383 | </div> |
| 384 | |
| 385 | <!-- ======================================================================= --> |
| 386 | <h3> |
| 387 | <a name="o_acqrel">AcquireRelease</a> |
| 388 | </h3> |
| 389 | |
| 390 | <div> |
| 391 | |
| 392 | <p>AcquireRelease (<code>acq_rel</code> in IR) provides both an Acquire and a |
| 393 | Release barrier (for fences and operations which both read and write memory). |
| 394 | |
| 395 | <dl> |
| 396 | <dt>Relevant standard</dt> |
| 397 | <dd>This corresponds to the C++0x/C1x <code>memory_order_acq_rel</code>. |
| 398 | <dt>Notes for frontends</dt> |
| 399 | <dd>If you are writing a frontend which uses this directly, use with caution. |
| 400 | Acquire only provides a semantic guarantee when paired with a Release |
| 401 | operation, and vice versa.</dd> |
| 402 | <dt>Notes for optimizers</dt> |
| 403 | <dd>In general, optimizers should treat this like a nothrow call; the |
| 404 | the possible optimizations are usually not interesting.</dd> |
| 405 | <dt>Notes for code generation</dt> |
| 406 | <dd>This operation has Acquire and Release semantics; see the sections on |
Eli Friedman | 5093fe6 | 2011-08-11 23:48:52 +0000 | [diff] [blame] | 407 | Acquire and Release.</dd> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 408 | </dl> |
| 409 | |
| 410 | </div> |
| 411 | |
| 412 | <!-- ======================================================================= --> |
| 413 | <h3> |
| 414 | <a name="o_seqcst">SequentiallyConsistent</a> |
| 415 | </h3> |
| 416 | |
| 417 | <div> |
| 418 | |
Andrew Trick | a1b953b | 2011-08-12 00:36:38 +0000 | [diff] [blame] | 419 | <p>SequentiallyConsistent (<code>seq_cst</code> in IR) provides |
| 420 | Acquire semantics for loads and Release semantics for |
| 421 | stores. Additionally, it guarantees that a total ordering exists |
| 422 | between all SequentiallyConsistent operations. |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 423 | |
| 424 | <dl> |
| 425 | <dt>Relevant standard</dt> |
| 426 | <dd>This corresponds to the C++0x/C1x <code>memory_order_seq_cst</code>, |
| 427 | Java volatile, and the gcc-compatible <code>__sync_*</code> builtins |
| 428 | which do not specify otherwise. |
| 429 | <dt>Notes for frontends</dt> |
| 430 | <dd>If a frontend is exposing atomic operations, these are much easier to |
| 431 | reason about for the programmer than other kinds of operations, and using |
| 432 | them is generally a practical performance tradeoff.</dd> |
| 433 | <dt>Notes for optimizers</dt> |
Eli Friedman | 79d7de7 | 2011-08-12 03:38:32 +0000 | [diff] [blame] | 434 | <dd>Optimizers not aware of atomics can treat this like a nothrow call. |
| 435 | For SequentiallyConsistent loads and stores, the same reorderings are |
| 436 | allowed as for Acquire loads and Release stores, except that |
| 437 | SequentiallyConsistent operations may not be reordered.</dd> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 438 | <dt>Notes for code generation</dt> |
Andrew Trick | a1b953b | 2011-08-12 00:36:38 +0000 | [diff] [blame] | 439 | <dd>SequentiallyConsistent loads minimally require the same barriers |
Nick Lewycky | e4481d8 | 2011-09-11 15:50:05 +0000 | [diff] [blame] | 440 | as Acquire operations and SequentiallyConsistent stores require |
Eli Friedman | 79d7de7 | 2011-08-12 03:38:32 +0000 | [diff] [blame] | 441 | Release barriers. Additionally, the code generator must enforce |
Nick Lewycky | e4481d8 | 2011-09-11 15:50:05 +0000 | [diff] [blame] | 442 | ordering between SequentiallyConsistent stores followed by |
| 443 | SequentiallyConsistent loads. This is usually done by emitting |
Eli Friedman | 79d7de7 | 2011-08-12 03:38:32 +0000 | [diff] [blame] | 444 | either a full fence before the loads or a full fence after the |
| 445 | stores; which is preferred varies by architecture.</dd> |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 446 | </dl> |
| 447 | |
| 448 | </div> |
| 449 | |
| 450 | </div> |
| 451 | |
| 452 | <!-- *********************************************************************** --> |
| 453 | <h2> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 454 | <a name="iropt">Atomics and IR optimization</a> |
| 455 | </h2> |
| 456 | <!-- *********************************************************************** --> |
| 457 | |
| 458 | <div> |
| 459 | |
| 460 | <p>Predicates for optimizer writers to query: |
| 461 | <ul> |
| 462 | <li>isSimple(): A load or store which is not volatile or atomic. This is |
| 463 | what, for example, memcpyopt would check for operations it might |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 464 | transform.</li> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 465 | <li>isUnordered(): A load or store which is not volatile and at most |
| 466 | Unordered. This would be checked, for example, by LICM before hoisting |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 467 | an operation.</li> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 468 | <li>mayReadFromMemory()/mayWriteToMemory(): Existing predicate, but note |
Eli Friedman | e2d8cf7 | 2011-08-10 20:17:43 +0000 | [diff] [blame] | 469 | that they return true for any operation which is volatile or at least |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 470 | Monotonic.</li> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 471 | <li>Alias analysis: Note that AA will return ModRef for anything Acquire or |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 472 | Release, and for the address accessed by any Monotonic operation.</li> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 473 | </ul> |
| 474 | |
Eli Friedman | 91a44dd | 2011-08-12 21:50:54 +0000 | [diff] [blame] | 475 | <p>To support optimizing around atomic operations, make sure you are using |
| 476 | the right predicates; everything should work if that is done. If your |
| 477 | pass should optimize some atomic operations (Unordered operations in |
| 478 | particular), make sure it doesn't replace an atomic load or store with |
| 479 | a non-atomic operation.</p> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 480 | |
| 481 | <p>Some examples of how optimizations interact with various kinds of atomic |
| 482 | operations: |
| 483 | <ul> |
| 484 | <li>memcpyopt: An atomic operation cannot be optimized into part of a |
| 485 | memcpy/memset, including unordered loads/stores. It can pull operations |
| 486 | across some atomic operations. |
| 487 | <li>LICM: Unordered loads/stores can be moved out of a loop. It just treats |
| 488 | monotonic operations like a read+write to a memory location, and anything |
| 489 | stricter than that like a nothrow call. |
| 490 | <li>DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores |
| 491 | can be DSE'ed in some cases, but it's tricky to reason about, and not |
| 492 | especially important. |
| 493 | <li>Folding a load: Any atomic load from a constant global can be |
| 494 | constant-folded, because it cannot be observed. Similar reasoning allows |
| 495 | scalarrepl with atomic loads and stores. |
| 496 | </ul> |
| 497 | |
| 498 | </div> |
| 499 | |
| 500 | <!-- *********************************************************************** --> |
| 501 | <h2> |
| 502 | <a name="codegen">Atomics and Codegen</a> |
| 503 | </h2> |
| 504 | <!-- *********************************************************************** --> |
| 505 | |
| 506 | <div> |
| 507 | |
| 508 | <p>Atomic operations are represented in the SelectionDAG with |
| 509 | <code>ATOMIC_*</code> opcodes. On architectures which use barrier |
| 510 | instructions for all atomic ordering (like ARM), appropriate fences are |
| 511 | split out as the DAG is built.</p> |
| 512 | |
| 513 | <p>The MachineMemOperand for all atomic operations is currently marked as |
| 514 | volatile; this is not correct in the IR sense of volatile, but CodeGen |
| 515 | handles anything marked volatile very conservatively. This should get |
| 516 | fixed at some point.</p> |
| 517 | |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 518 | <p>Common architectures have some way of representing at least a pointer-sized |
| 519 | lock-free <code>cmpxchg</code>; such an operation can be used to implement |
| 520 | all the other atomic operations which can be represented in IR up to that |
| 521 | size. Backends are expected to implement all those operations, but not |
| 522 | operations which cannot be implemented in a lock-free manner. It is |
| 523 | expected that backends will give an error when given an operation which |
| 524 | cannot be implemented. (The LLVM code generator is not very helpful here |
| 525 | at the moment, but hopefully that will change.)</p> |
| 526 | |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 527 | <p>The implementation of atomics on LL/SC architectures (like ARM) is currently |
| 528 | a bit of a mess; there is a lot of copy-pasted code across targets, and |
| 529 | the representation is relatively unsuited to optimization (it would be nice |
| 530 | to be able to optimize loops involving cmpxchg etc.).</p> |
| 531 | |
| 532 | <p>On x86, all atomic loads generate a <code>MOV</code>. |
| 533 | SequentiallyConsistent stores generate an <code>XCHG</code>, other stores |
| 534 | generate a <code>MOV</code>. SequentiallyConsistent fences generate an |
| 535 | <code>MFENCE</code>, other fences do not cause any code to be generated. |
| 536 | cmpxchg uses the <code>LOCK CMPXCHG</code> instruction. |
| 537 | <code>atomicrmw xchg</code> uses <code>XCHG</code>, |
| 538 | <code>atomicrmw add</code> and <code>atomicrmw sub</code> use |
| 539 | <code>XADD</code>, and all other <code>atomicrmw</code> operations generate |
| 540 | a loop with <code>LOCK CMPXCHG</code>. Depending on the users of the |
| 541 | result, some <code>atomicrmw</code> operations can be translated into |
| 542 | operations like <code>LOCK AND</code>, but that does not work in |
| 543 | general.</p> |
| 544 | |
| 545 | <p>On ARM, MIPS, and many other RISC architectures, Acquire, Release, and |
| 546 | SequentiallyConsistent semantics require barrier instructions |
| 547 | for every such operation. Loads and stores generate normal instructions. |
Eli Friedman | 1bf4ad4 | 2011-08-11 23:44:25 +0000 | [diff] [blame] | 548 | <code>cmpxchg</code> and <code>atomicrmw</code> can be represented using |
| 549 | a loop with LL/SC-style instructions which take some sort of exclusive |
| 550 | lock on a cache line (<code>LDREX</code> and <code>STREX</code> on |
| 551 | ARM, etc.). At the moment, the IR does not provide any way to represent a |
| 552 | weak <code>cmpxchg</code> which would not require a loop.</p> |
Eli Friedman | 138515d | 2011-08-09 21:07:10 +0000 | [diff] [blame] | 553 | </div> |
| 554 | |
| 555 | <!-- *********************************************************************** --> |
| 556 | |
| 557 | <hr> |
| 558 | <address> |
| 559 | <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| 560 | src="http://jigsaw.w3.org/css-validator/images/vcss-blue" alt="Valid CSS"></a> |
| 561 | <a href="http://validator.w3.org/check/referer"><img |
| 562 | src="http://www.w3.org/Icons/valid-html401-blue" alt="Valid HTML 4.01"></a> |
| 563 | |
| 564 | <a href="http://llvm.org/">LLVM Compiler Infrastructure</a><br> |
| 565 | Last modified: $Date: 2011-08-09 02:07:00 -0700 (Tue, 09 Aug 2011) $ |
| 566 | </address> |
| 567 | |
| 568 | </body> |
| 569 | </html> |