Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Review Checklist for RCU Patches |
| 2 | |
| 3 | |
| 4 | This document contains a checklist for producing and reviewing patches |
| 5 | that make use of RCU. Violating any of the rules listed below will |
| 6 | result in the same sorts of problems that leaving out a locking primitive |
| 7 | would cause. This list is based on experiences reviewing such patches |
| 8 | over a rather long period of time, but improvements are always welcome! |
| 9 | |
| 10 | 0. Is RCU being applied to a read-mostly situation? If the data |
| 11 | structure is updated more than about 10% of the time, then |
| 12 | you should strongly consider some other approach, unless |
| 13 | detailed performance measurements show that RCU is nonetheless |
| 14 | the right tool for the job. |
| 15 | |
| 16 | The other exception would be where performance is not an issue, |
| 17 | and RCU provides a simpler implementation. An example of this |
| 18 | situation is the dynamic NMI code in the Linux 2.6 kernel, |
| 19 | at least on architectures where NMIs are rare. |
| 20 | |
| 21 | 1. Does the update code have proper mutual exclusion? |
| 22 | |
| 23 | RCU does allow -readers- to run (almost) naked, but -writers- must |
| 24 | still use some sort of mutual exclusion, such as: |
| 25 | |
| 26 | a. locking, |
| 27 | b. atomic operations, or |
| 28 | c. restricting updates to a single task. |
| 29 | |
| 30 | If you choose #b, be prepared to describe how you have handled |
| 31 | memory barriers on weakly ordered machines (pretty much all of |
| 32 | them -- even x86 allows reads to be reordered), and be prepared |
| 33 | to explain why this added complexity is worthwhile. If you |
| 34 | choose #c, be prepared to explain how this single task does not |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 35 | become a major bottleneck on big multiprocessor machines (for |
| 36 | example, if the task is updating information relating to itself |
| 37 | that other tasks can read, there by definition can be no |
| 38 | bottleneck). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 39 | |
| 40 | 2. Do the RCU read-side critical sections make proper use of |
| 41 | rcu_read_lock() and friends? These primitives are needed |
| 42 | to suppress preemption (or bottom halves, in the case of |
| 43 | rcu_read_lock_bh()) in the read-side critical sections, |
| 44 | and are also an excellent aid to readability. |
| 45 | |
Paul E. McKenney | dd81eca | 2005-09-10 00:26:24 -0700 | [diff] [blame] | 46 | As a rough rule of thumb, any dereference of an RCU-protected |
| 47 | pointer must be covered by rcu_read_lock() or rcu_read_lock_bh() |
| 48 | or by the appropriate update-side lock. |
| 49 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 50 | 3. Does the update code tolerate concurrent accesses? |
| 51 | |
| 52 | The whole point of RCU is to permit readers to run without |
| 53 | any locks or atomic operations. This means that readers will |
| 54 | be running while updates are in progress. There are a number |
| 55 | of ways to handle this concurrency, depending on the situation: |
| 56 | |
| 57 | a. Make updates appear atomic to readers. For example, |
| 58 | pointer updates to properly aligned fields will appear |
| 59 | atomic, as will individual atomic primitives. Operations |
| 60 | performed under a lock and sequences of multiple atomic |
| 61 | primitives will -not- appear to be atomic. |
| 62 | |
| 63 | This is almost always the best approach. |
| 64 | |
| 65 | b. Carefully order the updates and the reads so that |
| 66 | readers see valid data at all phases of the update. |
| 67 | This is often more difficult than it sounds, especially |
| 68 | given modern CPUs' tendency to reorder memory references. |
| 69 | One must usually liberally sprinkle memory barriers |
| 70 | (smp_wmb(), smp_rmb(), smp_mb()) through the code, |
| 71 | making it difficult to understand and to test. |
| 72 | |
| 73 | It is usually better to group the changing data into |
| 74 | a separate structure, so that the change may be made |
| 75 | to appear atomic by updating a pointer to reference |
| 76 | a new structure containing updated values. |
| 77 | |
| 78 | 4. Weakly ordered CPUs pose special challenges. Almost all CPUs |
| 79 | are weakly ordered -- even i386 CPUs allow reads to be reordered. |
| 80 | RCU code must take all of the following measures to prevent |
| 81 | memory-corruption problems: |
| 82 | |
| 83 | a. Readers must maintain proper ordering of their memory |
| 84 | accesses. The rcu_dereference() primitive ensures that |
| 85 | the CPU picks up the pointer before it picks up the data |
| 86 | that the pointer points to. This really is necessary |
| 87 | on Alpha CPUs. If you don't believe me, see: |
| 88 | |
| 89 | http://www.openvms.compaq.com/wizard/wiz_2637.html |
| 90 | |
| 91 | The rcu_dereference() primitive is also an excellent |
| 92 | documentation aid, letting the person reading the code |
| 93 | know exactly which pointers are protected by RCU. |
| 94 | |
| 95 | The rcu_dereference() primitive is used by the various |
| 96 | "_rcu()" list-traversal primitives, such as the |
Paul E. McKenney | dd81eca | 2005-09-10 00:26:24 -0700 | [diff] [blame] | 97 | list_for_each_entry_rcu(). Note that it is perfectly |
| 98 | legal (if redundant) for update-side code to use |
| 99 | rcu_dereference() and the "_rcu()" list-traversal |
| 100 | primitives. This is particularly useful in code |
| 101 | that is common to readers and updaters. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 102 | |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 103 | b. If the list macros are being used, the list_add_tail_rcu() |
| 104 | and list_add_rcu() primitives must be used in order |
| 105 | to prevent weakly ordered machines from misordering |
| 106 | structure initialization and pointer planting. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 107 | Similarly, if the hlist macros are being used, the |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 108 | hlist_add_head_rcu() primitive is required. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 109 | |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 110 | c. If the list macros are being used, the list_del_rcu() |
| 111 | primitive must be used to keep list_del()'s pointer |
| 112 | poisoning from inflicting toxic effects on concurrent |
| 113 | readers. Similarly, if the hlist macros are being used, |
| 114 | the hlist_del_rcu() primitive is required. |
| 115 | |
| 116 | The list_replace_rcu() primitive may be used to |
| 117 | replace an old structure with a new one in an |
| 118 | RCU-protected list. |
| 119 | |
| 120 | d. Updates must ensure that initialization of a given |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 121 | structure happens before pointers to that structure are |
| 122 | publicized. Use the rcu_assign_pointer() primitive |
| 123 | when publicizing a pointer to a structure that can |
| 124 | be traversed by an RCU read-side critical section. |
| 125 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 126 | 5. If call_rcu(), or a related primitive such as call_rcu_bh(), |
| 127 | is used, the callback function must be written to be called |
| 128 | from softirq context. In particular, it cannot block. |
| 129 | |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 130 | 6. Since synchronize_rcu() can block, it cannot be called from |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 131 | any sort of irq context. |
| 132 | |
| 133 | 7. If the updater uses call_rcu(), then the corresponding readers |
| 134 | must use rcu_read_lock() and rcu_read_unlock(). If the updater |
| 135 | uses call_rcu_bh(), then the corresponding readers must use |
| 136 | rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up |
| 137 | will result in confusion and broken kernels. |
| 138 | |
| 139 | One exception to this rule: rcu_read_lock() and rcu_read_unlock() |
| 140 | may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() |
| 141 | in cases where local bottom halves are already known to be |
| 142 | disabled, for example, in irq or softirq context. Commenting |
| 143 | such cases is a must, of course! And the jury is still out on |
| 144 | whether the increased speed is worth it. |
| 145 | |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 146 | 8. Although synchronize_rcu() is a bit slower than is call_rcu(), |
Paul E. McKenney | 165d6c7 | 2006-06-25 05:48:44 -0700 | [diff] [blame] | 147 | it usually results in simpler code. So, unless update |
| 148 | performance is critically important or the updaters cannot block, |
| 149 | synchronize_rcu() should be used in preference to call_rcu(). |
| 150 | |
| 151 | An especially important property of the synchronize_rcu() |
| 152 | primitive is that it automatically self-limits: if grace periods |
| 153 | are delayed for whatever reason, then the synchronize_rcu() |
| 154 | primitive will correspondingly delay updates. In contrast, |
| 155 | code using call_rcu() should explicitly limit update rate in |
| 156 | cases where grace periods are delayed, as failing to do so can |
| 157 | result in excessive realtime latencies or even OOM conditions. |
| 158 | |
| 159 | Ways of gaining this self-limiting property when using call_rcu() |
| 160 | include: |
| 161 | |
| 162 | a. Keeping a count of the number of data-structure elements |
| 163 | used by the RCU-protected data structure, including those |
| 164 | waiting for a grace period to elapse. Enforce a limit |
| 165 | on this number, stalling updates as needed to allow |
| 166 | previously deferred frees to complete. |
| 167 | |
| 168 | Alternatively, limit only the number awaiting deferred |
| 169 | free rather than the total number of elements. |
| 170 | |
| 171 | b. Limiting update rate. For example, if updates occur only |
| 172 | once per hour, then no explicit rate limiting is required, |
| 173 | unless your system is already badly broken. The dcache |
| 174 | subsystem takes this approach -- updates are guarded |
| 175 | by a global lock, limiting their rate. |
| 176 | |
| 177 | c. Trusted update -- if updates can only be done manually by |
| 178 | superuser or some other trusted user, then it might not |
| 179 | be necessary to automatically limit them. The theory |
| 180 | here is that superuser already has lots of ways to crash |
| 181 | the machine. |
| 182 | |
| 183 | d. Use call_rcu_bh() rather than call_rcu(), in order to take |
| 184 | advantage of call_rcu_bh()'s faster grace periods. |
| 185 | |
| 186 | e. Periodically invoke synchronize_rcu(), permitting a limited |
| 187 | number of updates per grace period. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 188 | |
| 189 | 9. All RCU list-traversal primitives, which include |
| 190 | list_for_each_rcu(), list_for_each_entry_rcu(), |
| 191 | list_for_each_continue_rcu(), and list_for_each_safe_rcu(), |
| 192 | must be within an RCU read-side critical section. RCU |
| 193 | read-side critical sections are delimited by rcu_read_lock() |
| 194 | and rcu_read_unlock(), or by similar primitives such as |
| 195 | rcu_read_lock_bh() and rcu_read_unlock_bh(). |
| 196 | |
| 197 | Use of the _rcu() list-traversal primitives outside of an |
| 198 | RCU read-side critical section causes no harm other than |
Paul E. McKenney | dd81eca | 2005-09-10 00:26:24 -0700 | [diff] [blame] | 199 | a slight performance degradation on Alpha CPUs. It can |
| 200 | also be quite helpful in reducing code bloat when common |
| 201 | code is shared between readers and updaters. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 202 | |
| 203 | 10. Conversely, if you are in an RCU read-side critical section, |
| 204 | you -must- use the "_rcu()" variants of the list macros. |
| 205 | Failing to do so will break Alpha and confuse people reading |
| 206 | your code. |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 207 | |
| 208 | 11. Note that synchronize_rcu() -only- guarantees to wait until |
| 209 | all currently executing rcu_read_lock()-protected RCU read-side |
| 210 | critical sections complete. It does -not- necessarily guarantee |
| 211 | that all currently running interrupts, NMIs, preempt_disable() |
| 212 | code, or idle loops will complete. Therefore, if you do not have |
| 213 | rcu_read_lock()-protected read-side critical sections, do -not- |
| 214 | use synchronize_rcu(). |
| 215 | |
| 216 | If you want to wait for some of these other things, you might |
| 217 | instead need to use synchronize_irq() or synchronize_sched(). |
Paul E. McKenney | d19720a | 2006-02-01 03:06:42 -0800 | [diff] [blame] | 218 | |
| 219 | 12. Any lock acquired by an RCU callback must be acquired elsewhere |
| 220 | with irq disabled, e.g., via spin_lock_irqsave(). Failing to |
| 221 | disable irq on a given acquisition of that lock will result in |
| 222 | deadlock as soon as the RCU callback happens to interrupt that |
| 223 | acquisition's critical section. |
Paul E. McKenney | 621934e | 2006-10-04 02:17:02 -0700 | [diff] [blame] | 224 | |
Paul E. McKenney | ef48bd2 | 2007-07-15 23:41:03 -0700 | [diff] [blame] | 225 | 13. RCU callbacks can be and are executed in parallel. In many cases, |
| 226 | the callback code simply wrappers around kfree(), so that this |
| 227 | is not an issue (or, more accurately, to the extent that it is |
| 228 | an issue, the memory-allocator locking handles it). However, |
| 229 | if the callbacks do manipulate a shared data structure, they |
| 230 | must use whatever locking or other synchronization is required |
| 231 | to safely access and/or modify that data structure. |
| 232 | |
| 233 | 14. SRCU (srcu_read_lock(), srcu_read_unlock(), and synchronize_srcu()) |
Paul E. McKenney | 621934e | 2006-10-04 02:17:02 -0700 | [diff] [blame] | 234 | may only be invoked from process context. Unlike other forms of |
| 235 | RCU, it -is- permissible to block in an SRCU read-side critical |
| 236 | section (demarked by srcu_read_lock() and srcu_read_unlock()), |
| 237 | hence the "SRCU": "sleepable RCU". Please note that if you |
| 238 | don't need to sleep in read-side critical sections, you should |
| 239 | be using RCU rather than SRCU, because RCU is almost always |
| 240 | faster and easier to use than is SRCU. |
| 241 | |
| 242 | Also unlike other forms of RCU, explicit initialization |
| 243 | and cleanup is required via init_srcu_struct() and |
| 244 | cleanup_srcu_struct(). These are passed a "struct srcu_struct" |
| 245 | that defines the scope of a given SRCU domain. Once initialized, |
| 246 | the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock() |
| 247 | and synchronize_srcu(). A given synchronize_srcu() waits only |
| 248 | for SRCU read-side critical sections governed by srcu_read_lock() |
| 249 | and srcu_read_unlock() calls that have been passd the same |
| 250 | srcu_struct. This property is what makes sleeping read-side |
| 251 | critical sections tolerable -- a given subsystem delays only |
| 252 | its own updates, not those of other subsystems using SRCU. |
| 253 | Therefore, SRCU is less prone to OOM the system than RCU would |
| 254 | be if RCU's read-side critical sections were permitted to |
| 255 | sleep. |
| 256 | |
| 257 | The ability to sleep in read-side critical sections does not |
| 258 | come for free. First, corresponding srcu_read_lock() and |
| 259 | srcu_read_unlock() calls must be passed the same srcu_struct. |
| 260 | Second, grace-period-detection overhead is amortized only |
| 261 | over those updates sharing a given srcu_struct, rather than |
| 262 | being globally amortized as they are for other forms of RCU. |
| 263 | Therefore, SRCU should be used in preference to rw_semaphore |
| 264 | only in extremely read-intensive situations, or in situations |
| 265 | requiring SRCU's read-side deadlock immunity or low read-side |
| 266 | realtime latency. |
| 267 | |
| 268 | Note that, rcu_assign_pointer() and rcu_dereference() relate to |
| 269 | SRCU just as they do to other forms of RCU. |