Howard Hinnant | 086b718 | 2010-10-06 16:15:10 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| 2 | "http://www.w3.org/TR/html4/strict.dtd"> |
| 3 | <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ --> |
| 4 | <html> |
| 5 | <head> |
| 6 | <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| 7 | <title><atomic> design</title> |
| 8 | <link type="text/css" rel="stylesheet" href="menu.css"> |
| 9 | <link type="text/css" rel="stylesheet" href="content.css"> |
| 10 | </head> |
| 11 | |
| 12 | <body> |
| 13 | <div id="menu"> |
| 14 | <div> |
| 15 | <a href="http://llvm.org/">LLVM Home</a> |
| 16 | </div> |
| 17 | |
| 18 | <div class="submenu"> |
| 19 | <label>libc++ Info</label> |
| 20 | <a href="/index.html">About</a> |
| 21 | </div> |
| 22 | |
| 23 | <div class="submenu"> |
| 24 | <label>Quick Links</label> |
| 25 | <a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev">cfe-dev</a> |
| 26 | <a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits">cfe-commits</a> |
| 27 | <a href="http://llvm.org/bugs/">Bug Reports</a> |
| 28 | <a href="http://llvm.org/svn/llvm-project/libcxx/trunk/">Browse SVN</a> |
| 29 | <a href="http://llvm.org/viewvc/llvm-project/libcxx/trunk/">Browse ViewVC</a> |
| 30 | </div> |
| 31 | </div> |
| 32 | |
| 33 | <div id="content"> |
| 34 | <!--*********************************************************************--> |
| 35 | <h1><atomic> design</h1> |
| 36 | <!--*********************************************************************--> |
| 37 | |
| 38 | <p> |
| 39 | The <tt><atomic></tt> header is one of the most closely coupled headers to |
| 40 | the compiler. Ideally when you invoke any function from |
| 41 | <tt><atomic></tt>, it should result in highly optimized assembly being |
| 42 | inserted directly into your application ... assembly that is not otherwise |
| 43 | representable by higher level C or C++ expressions. The design of the libc++ |
| 44 | <tt><atomic></tt> header started with this goal in mind. A secondary, but |
| 45 | still very important goal is that the compiler should have to do minimal work to |
Howard Hinnant | 0dee9cd | 2013-02-15 15:37:50 +0000 | [diff] [blame] | 46 | facilitate the implementation of <tt><atomic></tt>. Without this second |
Howard Hinnant | 086b718 | 2010-10-06 16:15:10 +0000 | [diff] [blame] | 47 | goal, then practically speaking, the libc++ <tt><atomic></tt> header would |
| 48 | be doomed to be a barely supported, second class citizen on almost every |
| 49 | platform. |
| 50 | </p> |
| 51 | |
| 52 | <p>Goals:</p> |
| 53 | |
| 54 | <blockquote><ul> |
| 55 | <li>Optimal code generation for atomic operations</li> |
| 56 | <li>Minimal effort for the compiler to achieve goal 1 on any given platform</li> |
| 57 | <li>Conformance to the C++0X draft standard</li> |
| 58 | </ul></blockquote> |
| 59 | |
| 60 | <p> |
| 61 | The purpose of this document is to inform compiler writers what they need to do |
| 62 | to enable a high performance libc++ <tt><atomic></tt> with minimal effort. |
| 63 | </p> |
| 64 | |
| 65 | <h2>The minimal work that must be done for a conforming <tt><atomic></tt></h2> |
| 66 | |
| 67 | <p> |
| 68 | The only "atomic" operations that must actually be lock free in |
| 69 | <tt><atomic></tt> are represented by the following compiler intrinsics: |
| 70 | </p> |
| 71 | |
| 72 | <blockquote><pre> |
| 73 | __atomic_flag__ |
| 74 | __atomic_exchange_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr) |
| 75 | { |
| 76 | unique_lock<mutex> _(some_mutex); |
| 77 | __atomic_flag__ result = *obj; |
| 78 | *obj = desr; |
| 79 | return result; |
| 80 | } |
| 81 | |
| 82 | void |
| 83 | __atomic_store_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr) |
| 84 | { |
| 85 | unique_lock<mutex> _(some_mutex); |
| 86 | *obj = desr; |
| 87 | } |
| 88 | </pre></blockquote> |
| 89 | |
| 90 | <p> |
| 91 | Where: |
| 92 | </p> |
| 93 | |
| 94 | <blockquote><ul> |
| 95 | <li> |
| 96 | If <tt>__has_feature(__atomic_flag)</tt> evaluates to 1 in the preprocessor then |
| 97 | the compiler must define <tt>__atomic_flag__</tt> (e.g. as a typedef to |
| 98 | <tt>int</tt>). |
| 99 | </li> |
| 100 | <li> |
| 101 | If <tt>__has_feature(__atomic_flag)</tt> evaluates to 0 in the preprocessor then |
| 102 | the library defines <tt>__atomic_flag__</tt> as a typedef to <tt>bool</tt>. |
| 103 | </li> |
| 104 | <li> |
| 105 | <p> |
| 106 | To communicate that the above intrinsics are available, the compiler must |
| 107 | arrange for <tt>__has_feature</tt> to return 1 when fed the intrinsic name |
| 108 | appended with an '_' and the mangled type name of <tt>__atomic_flag__</tt>. |
| 109 | </p> |
| 110 | <p> |
| 111 | For example if <tt>__atomic_flag__</tt> is <tt>unsigned int</tt>: |
| 112 | </p> |
| 113 | <blockquote><pre> |
| 114 | __has_feature(__atomic_flag) == 1 |
| 115 | __has_feature(__atomic_exchange_seq_cst_j) == 1 |
| 116 | __has_feature(__atomic_store_seq_cst_j) == 1 |
| 117 | |
| 118 | typedef unsigned int __atomic_flag__; |
| 119 | |
| 120 | unsigned int __atomic_exchange_seq_cst(unsigned int volatile*, unsigned int) |
| 121 | { |
| 122 | // ... |
| 123 | } |
| 124 | |
| 125 | void __atomic_store_seq_cst(unsigned int volatile*, unsigned int) |
| 126 | { |
| 127 | // ... |
| 128 | } |
| 129 | </pre></blockquote> |
| 130 | </li> |
| 131 | </ul></blockquote> |
| 132 | |
| 133 | <p> |
| 134 | That's it! Compiler writers do the above and you've got a fully conforming |
| 135 | (though sub-par performance) <tt><atomic></tt> header! |
| 136 | </p> |
| 137 | |
| 138 | <h2>Recommended work for a higher performance <tt><atomic></tt></h2> |
| 139 | |
| 140 | <p> |
| 141 | It would be good if the above intrinsics worked with all integral types plus |
| 142 | <tt>void*</tt>. Because this may not be possible to do in a lock-free manner for |
| 143 | all integral types on all platforms, a compiler must communicate each type that |
| 144 | an intrinsic works with. For example if <tt>__atomic_exchange_seq_cst</tt> works |
| 145 | for all types except for <tt>long long</tt> and <tt>unsigned long long</tt> |
| 146 | then: |
| 147 | </p> |
| 148 | |
| 149 | <blockquote><pre> |
| 150 | __has_feature(__atomic_exchange_seq_cst_b) == 1 // bool |
| 151 | __has_feature(__atomic_exchange_seq_cst_c) == 1 // char |
| 152 | __has_feature(__atomic_exchange_seq_cst_a) == 1 // signed char |
| 153 | __has_feature(__atomic_exchange_seq_cst_h) == 1 // unsigned char |
| 154 | __has_feature(__atomic_exchange_seq_cst_Ds) == 1 // char16_t |
| 155 | __has_feature(__atomic_exchange_seq_cst_Di) == 1 // char32_t |
| 156 | __has_feature(__atomic_exchange_seq_cst_w) == 1 // wchar_t |
| 157 | __has_feature(__atomic_exchange_seq_cst_s) == 1 // short |
| 158 | __has_feature(__atomic_exchange_seq_cst_t) == 1 // unsigned short |
| 159 | __has_feature(__atomic_exchange_seq_cst_i) == 1 // int |
| 160 | __has_feature(__atomic_exchange_seq_cst_j) == 1 // unsigned int |
| 161 | __has_feature(__atomic_exchange_seq_cst_l) == 1 // long |
| 162 | __has_feature(__atomic_exchange_seq_cst_m) == 1 // unsigned long |
| 163 | __has_feature(__atomic_exchange_seq_cst_Pv) == 1 // void* |
| 164 | </pre></blockquote> |
| 165 | |
| 166 | <p> |
| 167 | Note that only the <tt>__has_feature</tt> flag is decorated with the argument |
| 168 | type. The name of the compiler intrinsic is not decorated, but instead works |
| 169 | like a C++ overloaded function. |
| 170 | </p> |
| 171 | |
| 172 | <p> |
| 173 | Additionally there are other intrinsics besides |
| 174 | <tt>__atomic_exchange_seq_cst</tt> and <tt>__atomic_store_seq_cst</tt>. They |
| 175 | are optional. But if the compiler can generate faster code than provided by the |
| 176 | library, then clients will benefit from the compiler writer's expertise and |
| 177 | knowledge of the targeted platform. |
| 178 | </p> |
| 179 | |
| 180 | <p> |
| 181 | Below is the complete list of <i>sequentially consistent</i> intrinsics, and |
| 182 | their library implementations. Template syntax is used to indicate the desired |
| 183 | overloading for integral and void* types. The template does not represent a |
| 184 | requirement that the intrinsic operate on <em>any</em> type! |
| 185 | </p> |
| 186 | |
| 187 | <blockquote><pre> |
| 188 | T is one of: bool, char, signed char, unsigned char, short, unsigned short, |
| 189 | int, unsigned int, long, unsigned long, |
| 190 | long long, unsigned long long, char16_t, char32_t, wchar_t, void* |
| 191 | |
| 192 | template <class T> |
| 193 | T |
| 194 | __atomic_load_seq_cst(T const volatile* obj) |
| 195 | { |
| 196 | unique_lock<mutex> _(some_mutex); |
| 197 | return *obj; |
| 198 | } |
| 199 | |
| 200 | template <class T> |
| 201 | void |
| 202 | __atomic_store_seq_cst(T volatile* obj, T desr) |
| 203 | { |
| 204 | unique_lock<mutex> _(some_mutex); |
| 205 | *obj = desr; |
| 206 | } |
| 207 | |
| 208 | template <class T> |
| 209 | T |
| 210 | __atomic_exchange_seq_cst(T volatile* obj, T desr) |
| 211 | { |
| 212 | unique_lock<mutex> _(some_mutex); |
| 213 | T r = *obj; |
| 214 | *obj = desr; |
| 215 | return r; |
| 216 | } |
| 217 | |
| 218 | template <class T> |
| 219 | bool |
| 220 | __atomic_compare_exchange_strong_seq_cst_seq_cst(T volatile* obj, T* exp, T desr) |
| 221 | { |
| 222 | unique_lock<mutex> _(some_mutex); |
| 223 | if (std::memcmp(const_cast<T*>(obj), exp, sizeof(T)) == 0) |
| 224 | { |
| 225 | std::memcpy(const_cast<T*>(obj), &desr, sizeof(T)); |
| 226 | return true; |
| 227 | } |
| 228 | std::memcpy(exp, const_cast<T*>(obj), sizeof(T)); |
| 229 | return false; |
| 230 | } |
| 231 | |
| 232 | template <class T> |
| 233 | bool |
| 234 | __atomic_compare_exchange_weak_seq_cst_seq_cst(T volatile* obj, T* exp, T desr) |
| 235 | { |
| 236 | unique_lock<mutex> _(some_mutex); |
| 237 | if (std::memcmp(const_cast<T*>(obj), exp, sizeof(T)) == 0) |
| 238 | { |
| 239 | std::memcpy(const_cast<T*>(obj), &desr, sizeof(T)); |
| 240 | return true; |
| 241 | } |
| 242 | std::memcpy(exp, const_cast<T*>(obj), sizeof(T)); |
| 243 | return false; |
| 244 | } |
| 245 | |
| 246 | T is one of: char, signed char, unsigned char, short, unsigned short, |
| 247 | int, unsigned int, long, unsigned long, |
| 248 | long long, unsigned long long, char16_t, char32_t, wchar_t |
| 249 | |
| 250 | template <class T> |
| 251 | T |
| 252 | __atomic_fetch_add_seq_cst(T volatile* obj, T operand) |
| 253 | { |
| 254 | unique_lock<mutex> _(some_mutex); |
| 255 | T r = *obj; |
| 256 | *obj += operand; |
| 257 | return r; |
| 258 | } |
| 259 | |
| 260 | template <class T> |
| 261 | T |
| 262 | __atomic_fetch_sub_seq_cst(T volatile* obj, T operand) |
| 263 | { |
| 264 | unique_lock<mutex> _(some_mutex); |
| 265 | T r = *obj; |
| 266 | *obj -= operand; |
| 267 | return r; |
| 268 | } |
| 269 | |
| 270 | template <class T> |
| 271 | T |
| 272 | __atomic_fetch_and_seq_cst(T volatile* obj, T operand) |
| 273 | { |
| 274 | unique_lock<mutex> _(some_mutex); |
| 275 | T r = *obj; |
| 276 | *obj &= operand; |
| 277 | return r; |
| 278 | } |
| 279 | |
| 280 | template <class T> |
| 281 | T |
| 282 | __atomic_fetch_or_seq_cst(T volatile* obj, T operand) |
| 283 | { |
| 284 | unique_lock<mutex> _(some_mutex); |
| 285 | T r = *obj; |
| 286 | *obj |= operand; |
| 287 | return r; |
| 288 | } |
| 289 | |
| 290 | template <class T> |
| 291 | T |
| 292 | __atomic_fetch_xor_seq_cst(T volatile* obj, T operand) |
| 293 | { |
| 294 | unique_lock<mutex> _(some_mutex); |
| 295 | T r = *obj; |
| 296 | *obj ^= operand; |
| 297 | return r; |
| 298 | } |
| 299 | |
| 300 | void* |
| 301 | __atomic_fetch_add_seq_cst(void* volatile* obj, ptrdiff_t operand) |
| 302 | { |
| 303 | unique_lock<mutex> _(some_mutex); |
| 304 | void* r = *obj; |
| 305 | (char*&)(*obj) += operand; |
| 306 | return r; |
| 307 | } |
| 308 | |
| 309 | void* |
| 310 | __atomic_fetch_sub_seq_cst(void* volatile* obj, ptrdiff_t operand) |
| 311 | { |
| 312 | unique_lock<mutex> _(some_mutex); |
| 313 | void* r = *obj; |
| 314 | (char*&)(*obj) -= operand; |
| 315 | return r; |
| 316 | } |
| 317 | |
| 318 | void __atomic_thread_fence_seq_cst() |
| 319 | { |
| 320 | unique_lock<mutex> _(some_mutex); |
| 321 | } |
| 322 | |
| 323 | void __atomic_signal_fence_seq_cst() |
| 324 | { |
| 325 | unique_lock<mutex> _(some_mutex); |
| 326 | } |
| 327 | </pre></blockquote> |
| 328 | |
| 329 | <p> |
| 330 | One should consult the (currently draft) |
| 331 | <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf">C++ standard</a> |
| 332 | for the details of the definitions for these operations. For example |
| 333 | <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> is allowed to fail |
| 334 | spuriously while <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> is |
| 335 | not. |
| 336 | </p> |
| 337 | |
| 338 | <p> |
| 339 | If on your platform the lock-free definition of |
| 340 | <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> would be the same as |
| 341 | <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt>, you may omit the |
| 342 | <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> intrinsic without a |
| 343 | performance cost. The library will prefer your implementation of |
| 344 | <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> over its own |
| 345 | definition for implementing |
| 346 | <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt>. That is, the library |
| 347 | will arrange for <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> to call |
| 348 | <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> if you supply an |
| 349 | intrinsic for the strong version but not the weak. |
| 350 | </p> |
| 351 | |
| 352 | <h2>Taking advantage of weaker memory synchronization</h2> |
| 353 | |
| 354 | <p> |
| 355 | So far all of the intrinsics presented require a <em>sequentially |
| 356 | consistent</em> memory ordering. That is, no loads or stores can move across |
| 357 | the operation (just as if the library had locked that internal mutex). But |
| 358 | <tt><atomic></tt> supports weaker memory ordering operations. In all, |
| 359 | there are six memory orderings (listed here from strongest to weakest): |
| 360 | </p> |
| 361 | |
| 362 | <blockquote><pre> |
| 363 | memory_order_seq_cst |
| 364 | memory_order_acq_rel |
| 365 | memory_order_release |
| 366 | memory_order_acquire |
| 367 | memory_order_consume |
| 368 | memory_order_relaxed |
| 369 | </pre></blockquote> |
| 370 | |
| 371 | <p> |
| 372 | (See the |
| 373 | <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf">C++ standard</a> |
| 374 | for the detailed definitions of each of these orderings). |
| 375 | </p> |
| 376 | |
| 377 | <p> |
| 378 | On some platforms, the compiler vendor can offer some or even all of the above |
| 379 | intrinsics at one or more weaker levels of memory synchronization. This might |
| 380 | lead for example to not issuing an <tt>mfence</tt> instruction on the x86. |
| 381 | </p> |
| 382 | |
| 383 | <p> |
| 384 | If the compiler does not offer any given operation, at any given memory ordering |
| 385 | level, the library will automatically attempt to call the next highest memory |
| 386 | ordering operation. This continues up to <tt>seq_cst</tt>, and if that doesn't |
| 387 | exist, then the library takes over and does the job with a <tt>mutex</tt>. This |
| 388 | is a compile-time search & selection operation. At run time, the |
| 389 | application will only see the few inlined assembly instructions for the selected |
| 390 | intrinsic. |
| 391 | </p> |
| 392 | |
| 393 | <p> |
| 394 | Each intrinsic is appended with the 7-letter name of the memory ordering it |
| 395 | addresses. For example a <tt>load</tt> with <tt>relaxed</tt> ordering is |
| 396 | defined by: |
| 397 | </p> |
| 398 | |
| 399 | <blockquote><pre> |
| 400 | T __atomic_load_relaxed(const volatile T* obj); |
| 401 | </pre></blockquote> |
| 402 | |
| 403 | <p> |
| 404 | And announced with: |
| 405 | </p> |
| 406 | |
| 407 | <blockquote><pre> |
| 408 | __has_feature(__atomic_load_relaxed_b) == 1 // bool |
| 409 | __has_feature(__atomic_load_relaxed_c) == 1 // char |
| 410 | __has_feature(__atomic_load_relaxed_a) == 1 // signed char |
| 411 | ... |
| 412 | </pre></blockquote> |
| 413 | |
| 414 | <p> |
| 415 | The <tt>__atomic_compare_exchange_strong(weak)</tt> intrinsics are parameterized |
| 416 | on two memory orderings. The first ordering applies when the operation returns |
| 417 | <tt>true</tt> and the second ordering applies when the operation returns |
| 418 | <tt>false</tt>. |
| 419 | </p> |
| 420 | |
| 421 | <p> |
| 422 | Not every memory ordering is appropriate for every operation. <tt>exchange</tt> |
| 423 | and the <tt>fetch_<i>op</i></tt> operations support all 6. But <tt>load</tt> |
| 424 | only supports <tt>relaxed</tt>, <tt>consume</tt>, <tt>acquire</tt> and <tt>seq_cst</tt>. |
| 425 | <tt>store</tt> |
| 426 | only supports <tt>relaxed</tt>, <tt>release</tt>, and <tt>seq_cst</tt>. The |
| 427 | <tt>compare_exchange</tt> operations support the following 16 combinations out |
| 428 | of the possible 36: |
| 429 | </p> |
| 430 | |
| 431 | <blockquote><pre> |
| 432 | relaxed_relaxed |
| 433 | consume_relaxed |
| 434 | consume_consume |
| 435 | acquire_relaxed |
| 436 | acquire_consume |
| 437 | acquire_acquire |
| 438 | release_relaxed |
| 439 | release_consume |
| 440 | release_acquire |
| 441 | acq_rel_relaxed |
| 442 | acq_rel_consume |
| 443 | acq_rel_acquire |
| 444 | seq_cst_relaxed |
| 445 | seq_cst_consume |
| 446 | seq_cst_acquire |
| 447 | seq_cst_seq_cst |
| 448 | </pre></blockquote> |
| 449 | |
| 450 | <p> |
| 451 | Again, the compiler supplies intrinsics only for the strongest orderings where |
| 452 | it can make a difference. The library takes care of calling the weakest |
| 453 | supplied intrinsic that is as strong or stronger than the customer asked for. |
| 454 | </p> |
| 455 | |
| 456 | </div> |
| 457 | </body> |
| 458 | </html> |